HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
Existing vision-language action models struggle to coordinate full-body motion in high-degree-of-freedom humanoid robots, often leading to instability due to independent control of individual body parts. This work proposes the HEX framework, which efficiently integrates visual-language instructions with proprioceptive dynamics through a humanoid-aligned universal state representation, a Mixture-of-Experts unified proprioception predictor, a lightweight historical token mechanism, and a residual gating fusion strategy. A flow-matching action head generates coherent whole-body motions grounded in this unified representation. The approach significantly enhances whole-body coordination, rapid responsiveness, and cross-platform generalization for humanoid robots, achieving state-of-the-art success rates on real-world manipulation tasks.

Technology Category

Application Category

📝 Abstract
Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.
Problem

Research questions and friction points this paper is trying to address.

whole-body manipulation
humanoid robots
cross-embodiment
coordinated control
high-DoF
Innovation

Methods, ideas, or system contributions that make the work stand out.

Humanoid-Aligned State Representation
Mixture-of-Experts
Whole-Body Coordination
Temporal Visual Context
Residual-Gated Fusion
🔎 Similar Papers
Shuanghao Bai
Shuanghao Bai
Xi'an Jiao Tong University Phd student
Vision Language ModelsDomain AdaptationDomain GeneralizationRobotic Manipulation
Meng Li
Meng Li
Beijing University of Posts and Telecommunications
Child-Computer InteractionDigital Heritage
X
Xinyuan Lv
Beijing Innovation Center of Humanoid Robotics
J
Jiawei Wang
Beijing Innovation Center of Humanoid Robotics
X
Xinhua Wang
Beijing Innovation Center of Humanoid Robotics
F
Fei Liao
Beijing Innovation Center of Humanoid Robotics
Chengkai Hou
Chengkai Hou
Peking University
Robot
L
Langzhe Gu
Peking University
Wanqi Zhou
Wanqi Zhou
Xi'an Jiaotong University
Information theorycausal discoveryout of distribution generalization
K
Kun Wu
Beijing Innovation Center of Humanoid Robotics
Ziluo Ding
Ziluo Ding
Unknown affiliation
Reinforcement LearningOptical Flow
Z
Zhiyuan Xu
Beijing Innovation Center of Humanoid Robotics
L
Lei Sun
Nankai University
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics
Badong Chen
Badong Chen
Professor of Xi'an Jiaotong University, Xi'an, China
signal processingmachine learningbrain machine interfacesrobotics