ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing humanoid robots are limited in autonomous whole-body loco-manipulation due to low-fidelity motion retargeting, restricted skill repertoires, and reliance on predefined action sequences. This work proposes ULTRA, a unified framework that, for the first time, directly generates robust whole-body behaviors from sparse task instructions and egocentric visual inputs without requiring reference motions at test time. ULTRA integrates physics-driven neural retargeting, a multimodal reinforcement learning controller, and latent-space compression of locomotion and manipulation skills. Generalization is further enhanced through universal tracking policy distillation and out-of-distribution fine-tuning. Experiments in simulation and on the Unitree G1 physical robot demonstrate that ULTRA significantly outperforms pure motion-tracking baselines, achieving superior autonomy, generalization, and robustness.

Technology Category

Application Category

📝 Abstract
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
Problem

Research questions and friction points this paper is trying to address.

whole-body loco-manipulation
autonomous humanoid
motion retargeting
task specification
multimodal control
Innovation

Methods, ideas, or system contributions that make the work stand out.

neural retargeting
unified multimodal control
whole-body loco-manipulation
reinforcement learning finetuning
egocentric perception
🔎 Similar Papers
No similar papers found.
X
Xialin He
University of Illinois Urbana-Champaign
Sirui Xu
Sirui Xu
University of Illinois at Urbana-Champaign
Computer VisionMachine LearningVirtual HumansCharacter AnimationHuman-Object Interaction
Xinyao Li
Xinyao Li
University of Electronic Science and Technology of China
Runpei Dong
Runpei Dong
PhD Student, University of Illinois Urbana-Champaign
Robot LearningReinforcement LearningMachine Learning
L
Liuyu Bian
University of Illinois Urbana-Champaign
Y
Yu-Xiong Wang
University of Illinois Urbana-Champaign
L
Liang-Yan Gui
University of Illinois Urbana-Champaign