🤖 AI Summary
Existing vision-language action policies face a fundamental trade-off: generative reasoning enables strong language controllability but incurs high latency, whereas eliminating reasoning drastically reduces responsiveness and semantic fidelity. This work proposes the Mixture-of-Transformers (MoT) architecture—the first to unify fast and slow dual-path reasoning within a single vision-language action model. The fast path employs lightweight domain-expert networks for millisecond-level decision-making; the slow path leverages pretrained vision-language models for deep semantic grounding. Crucially, MoT introduces action-conditioned decomposition and joint behavioral policy learning to coordinate both pathways. Evaluated across NLP benchmarks, robotic simulation, and real-world deployment, MoT achieves sub-100ms end-to-end latency while significantly improving instruction following accuracy and cross-task generalization. Our approach establishes a new paradigm for efficient, language-controllable embodied intelligence in open-world settings.
📝 Abstract
Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated.In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA) model that integrates fast-slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate domain-specific fast reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both fast-slow reasoning and manipulation task performance.