HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current robot policy models struggle with cross-platform heterogeneity—including divergent robot morphologies, action spaces, sensor configurations, and control frequencies—resulting in poor generalization and limited transferability across hardware platforms. To address this, we propose the Hierarchical Mixture of Experts (HiMoE) architecture, which employs hierarchical adaptive modeling to progressively decouple heterogeneous factors and learn shared multimodal representations, enabling joint visual-language-action policy learning with explicit sensorimotor alignment across varying actuation frequencies. Trained on large-scale heterogeneous robot data, HiMoE is the first model to achieve unified visual-language-action policy modeling across morphologically diverse physical robots. Experiments demonstrate substantial improvements over state-of-the-art vision-language-action (VLA) methods in both simulation and real-robot settings, with significant gains in task accuracy, strong cross-device robustness, and zero-shot transfer capability to unseen robotic platforms.

Technology Category

Application Category

📝 Abstract
The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.
Problem

Research questions and friction points this paper is trying to address.

Handles heterogeneity in robotic demonstration data
Integrates diverse factors for better generalization
Adapts to various embodiments and action spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Mixture-of-Experts architecture adaptively handles heterogeneity
Gradually abstracts diverse factors into shared knowledge representations
Achieves robust generalization across diverse robots and action spaces
Z
Zhiying Du
Fudan University
B
Bei Liu
Microsoft Research Asia
Yaobo Liang
Yaobo Liang
microsoft.com
Embodied AINatural Language ProcessingAI Agent
Y
Yichao Shen
Xi’an Jiaotong University
H
Haidong Cao
Fudan University
X
Xiangyu Zheng
Fudan University
Z
Zhiyuan Feng
Tsinghua University
Zuxuan Wu
Zuxuan Wu
Fudan University
Jiaolong Yang
Jiaolong Yang
Microsoft Research
3D Computer Vision
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI