HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of existing vision–language–action (VLA) models to perform explicit multimodal reasoning in long-horizon or out-of-distribution tasks, where coordinated text understanding, visual prediction, and action decision-making are required. To this end, we propose HALO, the first embodied multimodal chain-of-thought (EM-CoT) framework that unifies human-like reasoning through a sequential process of task decomposition, subgoal prediction, and action generation. HALO employs a Mixture-of-Experts Transformer architecture trained on automatically synthesized EM-CoT data and guided by a tailored policy to effectively align semantic, visual, and motor modules. Evaluated on the RoboTwin benchmark, HALO outperforms the pi₀ baseline by 34.1% and demonstrates exceptional generalization in highly randomized, unseen environments.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
multimodal reasoning
embodied cognition
long-horizon tasks
out-of-distribution generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Multimodal Chain-of-Thought
Vision-Language-Action Model
Mixture-of-Transformers
Visual Subgoal Prediction
Robotic Generalization
🔎 Similar Papers
Q
Quanxin Shou
The Hong Kong University of Science and Technology
F
Fangqi Zhu
The Hong Kong University of Science and Technology
Shawn Chen
Shawn Chen
Global Health Drug Discovery Institute
MicrobiologyChemical BiologyAntimicrobials
P
Puxin Yan
Sun Yat-sen University
Z
Zhengyang Yan
The Hong Kong University of Science and Technology
Y
Yikun Miao
The Hong Kong University of Science and Technology
X
Xiaoyi Pang
The Hong Kong University of Science and Technology
Zicong Hong
Zicong Hong
Department of Computer Science and Engineering, Hong Kong University of Science and Technology
BlockchainML SystemEdge/Cloud Computing
R
Ruikai Shi
The Hong Kong University of Science and Technology
H
Hao Huang
The Hong Kong University of Science and Technology
J
Jie Zhang
The Hong Kong University of Science and Technology
Song Guo
Song Guo
Chair Professor of CSE, HKUST
Large Language ModelEdge AIMachine Learning Systems