🤖 AI Summary
In high-dimensional humanoid locomotion control (e.g., 26-DOF Unitree H1-2), reinforcement learning faces challenges including dynamical instability, complex contact interactions, and policy mismatch and convergence difficulties arising from distributional shift in off-policy training. To address these, we propose TD-MPC-GRPO: the first integration of Group Relative Policy Optimization (GRPO) and explicit Policy Constraints (PC) into the TD-MPC framework. It imposes trust-region constraints in the latent space and introduces a physics-informed, multi-candidate trajectory ranking mechanism—without modifying the underlying planner. Our method synergistically combines temporal-difference learning, model predictive control, GRPO, latent-space trust-region regularization, and physics-guided relative ranking. Experiments demonstrate substantial improvements in stability across locomotion-to-high-dynamic maneuvers, policy robustness against disturbances, and sample efficiency during training.
📝 Abstract
Robot learning in high-dimensional control settings, such as humanoid locomotion, presents persistent challenges for reinforcement learning (RL) algorithms due to unstable dynamics, complex contact interactions, and sensitivity to distributional shifts during training. Model-based methods, extit{e.g.}, Temporal-Difference Model Predictive Control (TD-MPC), have demonstrated promising results by combining short-horizon planning with value-based learning, enabling efficient solutions for basic locomotion tasks. However, these approaches remain ineffective in addressing policy mismatch and instability introduced by off-policy updates. Thus, in this work, we introduce Temporal-Difference Group Relative Policy Constraint (TD-GRPC), an extension of the TD-MPC framework that unifies Group Relative Policy Optimization (GRPO) with explicit Policy Constraints (PC). TD-GRPC applies a trust-region constraint in the latent policy space to maintain consistency between the planning priors and learned rollouts, while leveraging group-relative ranking to assess and preserve the physical feasibility of candidate trajectories. Unlike prior methods, TD-GRPC achieves robust motions without modifying the underlying planner, enabling flexible planning and policy learning. We validate our method across a locomotion task suite ranging from basic walking to highly dynamic movements on the 26-DoF Unitree H1-2 humanoid robot. Through simulation results, TD-GRPC demonstrates its improvements in stability and policy robustness with sampling efficiency while training for complex humanoid control tasks.