🤖 AI Summary
To address inefficient training in reinforcement learning caused by the unavailability of simulator gradients, this paper proposes a novel architecture that decouples trajectory sampling from gradient computation: trajectories are collected via the true (non-differentiable) simulator, while policy gradients are estimated through backpropagation through a differentiable dynamics model. This enables first-order policy optimization without requiring simulator derivatives. The approach avoids performance degradation due to error accumulation inherent in conventional model-based RL (MBRL), while supporting high-fidelity value function learning. It is compatible with general-purpose algorithms such as PPO and inherits the efficient optimization properties of SHAC. Experiments demonstrate that the method achieves sample efficiency and training speed comparable to SHAC on standard continuous-control benchmarks—significantly outperforming classical MBRL baselines—and successfully realizes stable bipedal and quadrupedal locomotion on the Go2 quadruped robot.
📝 Abstract
There is growing interest in reinforcement learning (RL) methods that leverage the simulator's derivatives to improve learning efficiency. While early gradient-based approaches have demonstrated superior performance compared to derivative-free methods, accessing simulator gradients is often impractical due to their implementation cost or unavailability. Model-based RL (MBRL) can approximate these gradients via learned dynamics models, but the solver efficiency suffers from compounding prediction errors during training rollouts, which can degrade policy performance. We propose an approach that decouples trajectory generation from gradient computation: trajectories are unrolled using a simulator, while gradients are computed via backpropagation through a learned differentiable model of the simulator. This hybrid design enables efficient and consistent first-order policy optimization, even when simulator gradients are unavailable, as well as learning a critic from simulation rollouts, which is more accurate. Our method achieves the sample efficiency and speed of specialized optimizers such as SHAC, while maintaining the generality of standard approaches like PPO and avoiding ill behaviors observed in other first-order MBRL methods. We empirically validate our algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot, across both quadrupedal and bipedal locomotion tasks.