π€ AI Summary
This work addresses the inefficiency and instability in reinforcement learning for humanoid robots caused by high-dimensional action spaces and model mismatch. To overcome these challenges, the authors propose a novel framework that integrates parameterized model predictive control (MPC) with reinforcement learning. By designing a cost-matching mechanism, the MPC-predicted cost directly approximates the true action-value function, enabling efficient gradient-based updates without repeatedly solving the MPC optimization online. The approach constructs a parameterized MPC based on centroidal dynamics and trains it end-to-end via gradient descent. Experimental results demonstrate that the method significantly outperforms hand-tuned baselines in simulation, exhibiting superior robustness, locomotion performance, and policy generalization under model mismatch and external disturbances.
π Abstract
In this paper, we propose a cost-matching approach for optimal humanoid locomotion within a Model Predictive Control (MPC)-based Reinforcement Learning (RL) framework. A parameterized MPC formulation with centroidal dynamics is trained to approximate the action-value function obtained from high-fidelity closed-loop data. Specifically, the MPC cost-to-go is evaluated along recorded state-action trajectories, and the parameters are updated to minimize the discrepancy between MPC-predicted values and measured returns. This formulation enables efficient gradient-based learning while avoiding the computational burden of repeatedly solving the MPC problem during training. The proposed method is validated in simulation using a commercial humanoid platform. Results demonstrate improved locomotion performance and robustness to model mismatch and external disturbances compared with manually tuned baselines.