TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In high-dimensional humanoid locomotion control (e.g., 26-DOF Unitree H1-2), reinforcement learning faces challenges including dynamical instability, complex contact interactions, and policy mismatch and convergence difficulties arising from distributional shift in off-policy training. To address these, we propose TD-MPC-GRPO: the first integration of Group Relative Policy Optimization (GRPO) and explicit Policy Constraints (PC) into the TD-MPC framework. It imposes trust-region constraints in the latent space and introduces a physics-informed, multi-candidate trajectory ranking mechanism—without modifying the underlying planner. Our method synergistically combines temporal-difference learning, model predictive control, GRPO, latent-space trust-region regularization, and physics-guided relative ranking. Experiments demonstrate substantial improvements in stability across locomotion-to-high-dynamic maneuvers, policy robustness against disturbances, and sample efficiency during training.

Technology Category

Application Category

📝 Abstract
Robot learning in high-dimensional control settings, such as humanoid locomotion, presents persistent challenges for reinforcement learning (RL) algorithms due to unstable dynamics, complex contact interactions, and sensitivity to distributional shifts during training. Model-based methods, extit{e.g.}, Temporal-Difference Model Predictive Control (TD-MPC), have demonstrated promising results by combining short-horizon planning with value-based learning, enabling efficient solutions for basic locomotion tasks. However, these approaches remain ineffective in addressing policy mismatch and instability introduced by off-policy updates. Thus, in this work, we introduce Temporal-Difference Group Relative Policy Constraint (TD-GRPC), an extension of the TD-MPC framework that unifies Group Relative Policy Optimization (GRPO) with explicit Policy Constraints (PC). TD-GRPC applies a trust-region constraint in the latent policy space to maintain consistency between the planning priors and learned rollouts, while leveraging group-relative ranking to assess and preserve the physical feasibility of candidate trajectories. Unlike prior methods, TD-GRPC achieves robust motions without modifying the underlying planner, enabling flexible planning and policy learning. We validate our method across a locomotion task suite ranging from basic walking to highly dynamic movements on the 26-DoF Unitree H1-2 humanoid robot. Through simulation results, TD-GRPC demonstrates its improvements in stability and policy robustness with sampling efficiency while training for complex humanoid control tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses policy mismatch and instability in humanoid locomotion RL
Combines trust-region constraints with group-relative trajectory ranking
Enhances stability and robustness in high-dimensional robot control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines TD-MPC with GRPO and Policy Constraints
Applies trust-region constraint in latent policy space
Uses group-relative ranking for feasible trajectories
🔎 Similar Papers
No similar papers found.