Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the bias in first-order gradient estimation arising from discontinuous dynamics in policy gradient reinforcement learning, as well as limitations of existing approaches that rely on high-variance REINFORCE estimators, require task-specific hyperparameter tuning, and suffer from poor sample efficiency. To overcome these challenges, the paper introduces two novel methods: DDCG, a lightweight estimator-switching mechanism that achieves robust and sample-efficient learning in discontinuous tasks with only a single hyperparameter; and IVW-H, a variance-stabilizing technique for differentiable robot control that fuses gradient estimates via inverse-variance weighting without requiring explicit discontinuity detection. Empirical results demonstrate that DDCG exhibits strong robustness on standard discontinuous benchmarks, while IVW-H substantially improves performance, highlighting that careful variance control is often more critical in practice than bias correction.

Technology Category

Application Category

📝 Abstract

In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.

Problem

Research questions and friction points this paper is trying to address.

policy gradients

differentiable simulators

discontinuous dynamics

gradient estimation bias

sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

differentiable simulators

policy gradients

discontinuous dynamics