One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humanoid locomotion suffers from poor cross-terrain generalization, heavy reliance on hand-crafted task rewards, and limited scalability to novel environments. To address these challenges, we propose DreamPolicy—the first unified humanoid control framework enabling zero-shot cross-terrain generalization. Its core innovation is Humanoid Motion Imagery (HMI): a terrain-aware diffusion planner that generates physically feasible “dream” trajectories, eliminating explicit reward engineering and task-specific training paradigms. DreamPolicy integrates offline reinforcement learning, autoregressive terrain-conditioned diffusion modeling, an HMI-conditioned policy network, and multi-expert rollout aggregation. On standard benchmarks, DreamPolicy achieves a 90% average success rate on seen terrains and outperforms state-of-the-art methods by 20% on unseen terrains. Moreover, it demonstrates strong robustness against dynamic disturbances and complex composite scenarios.

Technology Category

Application Category

📝 Abstract
Humanoid locomotion faces a critical scalability challenge: traditional reinforcement learning (RL) methods require task-specific rewards and struggle to leverage growing datasets, even as more training terrains are introduced. We propose DreamPolicy, a unified framework that enables a single policy to master diverse terrains and generalize zero-shot to unseen scenarios by systematically integrating offline data and diffusion-driven motion synthesis. At its core, DreamPolicy introduces Humanoid Motion Imagery (HMI) - future state predictions synthesized through an autoregressive terrain-aware diffusion planner curated by aggregating rollouts from specialized policies across various distinct terrains. Unlike human motion datasets requiring laborious retargeting, our data directly captures humanoid kinematics, enabling the diffusion planner to synthesize"dreamed"trajectories that encode terrain-specific physical constraints. These trajectories act as dynamic objectives for our HMI-conditioned policy, bypassing manual reward engineering and enabling cross-terrain generalization. DreamPolicy addresses the scalability limitations of prior methods: while traditional RL fails to exploit growing datasets, our framework scales seamlessly with more offline data. As the dataset expands, the diffusion prior learns richer locomotion skills, which the policy leverages to master new terrains without retraining. Experiments demonstrate that DreamPolicy achieves average 90% success rates in training environments and an average of 20% higher success on unseen terrains than the prevalent method. It also generalizes to perturbed and composite scenarios where prior approaches collapse. By unifying offline data, diffusion-based trajectory synthesis, and policy optimization, DreamPolicy overcomes the"one task, one policy"bottleneck, establishing a paradigm for scalable, data-driven humanoid control.
Problem

Research questions and friction points this paper is trying to address.

Scalable unified policy for diverse humanoid locomotion tasks
Zero-shot generalization to unseen terrains and scenarios
Overcoming one-task-one-policy bottleneck with data-driven control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified policy framework for diverse terrains
Diffusion-driven motion synthesis for generalization
Offline data integration for scalable learning
🔎 Similar Papers
No similar papers found.