MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language action policies face a fundamental trade-off: generative reasoning enables strong language controllability but incurs high latency, whereas eliminating reasoning drastically reduces responsiveness and semantic fidelity. This work proposes the Mixture-of-Transformers (MoT) architecture—the first to unify fast and slow dual-path reasoning within a single vision-language action model. The fast path employs lightweight domain-expert networks for millisecond-level decision-making; the slow path leverages pretrained vision-language models for deep semantic grounding. Crucially, MoT introduces action-conditioned decomposition and joint behavioral policy learning to coordinate both pathways. Evaluated across NLP benchmarks, robotic simulation, and real-world deployment, MoT achieves sub-100ms end-to-end latency while significantly improving instruction following accuracy and cross-task generalization. Our approach establishes a new paradigm for efficient, language-controllable embodied intelligence in open-world settings.

Technology Category

Application Category

📝 Abstract
Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated.In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA) model that integrates fast-slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate domain-specific fast reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both fast-slow reasoning and manipulation task performance.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited language steerability in robot learning
Reduces significant inference latency in reasoning models
Integrates fast-slow reasoning with behavior policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates fast-slow reasoning with behavior policy learning
Uses mixture-of-transformers with shared knowledge architecture
Conditions action expert on decomposed motion instructions
🔎 Similar Papers
No similar papers found.
W
Wenhui Huang
Harvard University
Changhe Chen
Changhe Chen
University of Michigan
RoboticsEmbodied AIManipulationAutonomous Driving
H
Han Qi
Harvard University
C
Chen Lv
Nanyang Technological University
Yilun Du
Yilun Du
Harvard University
Artificial IntelligenceMachine LearningRoboticsComputer Vision
H
Heng Yang
Harvard University