🤖 AI Summary
Large language models lack unified perception–understanding–action capabilities, leading to two fundamental bottlenecks in vision-language-action (VLA) modeling: spurious forgetting and task interference. To address these challenges, we propose a unified VLA modeling framework featuring a novel staged alignment training strategy: first establishing robust foundational robot control skills, then progressively integrating multimodal sensory data. We further introduce multi-task decoupled learning and a sparse-gated Mixture-of-Experts (MoE) architecture to enable synergistic optimization of semantic understanding and motor control. Our approach achieves a 6× performance gain on the MMMU benchmark and attains 47.2% accuracy on MMStar. Moreover, it significantly outperforms state-of-the-art methods—including OpenVLA—across 25 real-world robotic manipulation tasks, demonstrating substantial improvements in joint-training stability and cross-task generalization.
📝 Abstract
Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can't large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks. Notably, it achieves a six times higher performance on MMMU and scores 47.2% on MMStar with a more parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates superior performance on 25 real-world robot manipulation tasks compared to existing VLA methods like OpenVLA. Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.