HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the limited capability of existing vision–language–action (VLA) models to perform explicit multimodal reasoning in long-horizon or out-of-distribution tasks, where coordinated text understanding, visual prediction, and action decision-making are required. To this end, we propose HALO, the first embodied multimodal chain-of-thought (EM-CoT) framework that unifies human-like reasoning through a sequential process of task decomposition, subgoal prediction, and action generation. HALO employs a Mixture-of-Experts Transformer architecture trained on automatically synthesized EM-CoT data and guided by a tailored policy to effectively align semantic, visual, and motor modules. Evaluated on the RoboTwin benchmark, HALO outperforms the pi₀ baseline by 34.1% and demonstrates exceptional generalization in highly randomized, unseen environments.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

multimodal reasoning

embodied cognition

long-horizon tasks

out-of-distribution generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Multimodal Chain-of-Thought

Vision-Language-Action Model

Mixture-of-Transformers