๐ค AI Summary
Generating diverse, physically feasible whole-body motions for humanoid robots from natural language instructions remains challenging. Method: We propose a large languageโaction model framework that establishes a unified human-robot motion vocabulary, integrating discrete motion tokenization, privileged policy distillation, and dynamics-aware reinforcement learning fine-tuning to enable end-to-end mapping from language to high-fidelity, dynamically stable motions. Contribution/Results: Our approach is the first to jointly design semantic motion discretization and physics-embedded policy optimization, balancing generalization and physical feasibility. Evaluated in simulation and on a real Unitree G1 robot, it achieves significant improvements over prior methods in motion naturalness, dynamic stability, and multi-step task success rate. This work establishes a scalable, language-conditioned whole-body control paradigm for general embodied intelligence.
๐ Abstract
Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.