🤖 AI Summary
Humanoid robots must simultaneously achieve precise navigation command tracking and compliant interaction with external forces; however, existing reinforcement learning (RL) approaches prioritize robustness, yielding overly rigid responses and insufficient compliance. This paper proposes a preference-conditioned multi-objective RL framework that—novelty—introduces dynamic preference modulation into multi-objective RL, unifying rigid trajectory tracking and compliant behavior within a single policy. We explicitly model external force effects via a velocity-resistance factor and employ an encoder-decoder architecture to extract privileged features from lightweight observations, enabling end-to-end omnidirectional walking control. Evaluated in simulation and on a physical humanoid platform, our method significantly improves policy adaptability and training convergence speed, supports real-time behavioral preference switching, and achieves high-performance, deployable bipedal locomotion.
📝 Abstract
Humanoid locomotion requires not only accurate command tracking for navigation but also compliant responses to external forces during human interaction. Despite significant progress, existing RL approaches mainly emphasize robustness, yielding policies that resist external forces but lack compliance-particularly challenging for inherently unstable humanoids. In this work, we address this by formulating humanoid locomotion as a multi-objective optimization problem that balances command tracking and external force compliance. We introduce a preference-conditioned multi-objective RL (MORL) framework that integrates rigid command following and compliant behaviors within a single omnidirectional locomotion policy. External forces are modeled via velocity-resistance factor for consistent reward design, and training leverages an encoder-decoder structure that infers task-relevant privileged features from deployable observations. We validate our approach in both simulation and real-world experiments on a humanoid robot. Experimental results indicate that our framework not only improves adaptability and convergence over standard pipelines, but also realizes deployable preference-conditioned humanoid locomotion.