HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language action models struggle to coordinate full-body motion in high-degree-of-freedom humanoid robots, often leading to instability due to independent control of individual body parts. This work proposes the HEX framework, which efficiently integrates visual-language instructions with proprioceptive dynamics through a humanoid-aligned universal state representation, a Mixture-of-Experts unified proprioception predictor, a lightweight historical token mechanism, and a residual gating fusion strategy. A flow-matching action head generates coherent whole-body motions grounded in this unified representation. The approach significantly enhances whole-body coordination, rapid responsiveness, and cross-platform generalization for humanoid robots, achieving state-of-the-art success rates on real-world manipulation tasks.
📝 Abstract
Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.
Problem

Research questions and friction points this paper is trying to address.

whole-body manipulation
humanoid robots
cross-embodiment
coordinated control
high-DoF
Innovation

Methods, ideas, or system contributions that make the work stand out.

Humanoid-Aligned State Representation
Mixture-of-Experts
Whole-Body Coordination
Temporal Visual Context
Residual-Gated Fusion
🔎 Similar Papers
No similar papers found.
Shuanghao Bai
Shuanghao Bai
Xi'an Jiao Tong University Phd student
Vision Language ModelsDomain AdaptationDomain GeneralizationRobotic Manipulation
Meng Li
Meng Li
Beijing University of Posts and Telecommunications
Child-Computer InteractionDigital Heritage
X
Xinyuan Lv
Beijing Innovation Center of Humanoid Robotics
J
Jiawei Wang
Beijing Innovation Center of Humanoid Robotics
X
Xinhua Wang
Beijing Innovation Center of Humanoid Robotics
F
Fei Liao
Beijing Innovation Center of Humanoid Robotics
Chengkai Hou
Chengkai Hou
Peking University
Robot
L
Langzhe Gu
Peking University
Wanqi Zhou
Wanqi Zhou
Xi'an Jiaotong University
Information theorycausal discoveryout of distribution generalization
K
Kun Wu
Beijing Innovation Center of Humanoid Robotics
Ziluo Ding
Ziluo Ding
Unknown affiliation
Reinforcement LearningOptical Flow
Z
Zhiyuan Xu
Beijing Innovation Center of Humanoid Robotics
L
Lei Sun
Nankai University
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics
Badong Chen
Badong Chen
Professor of Xi'an Jiaotong University, Xi'an, China
signal processingmachine learningbrain machine interfacesrobotics