๐ค AI Summary
Current vision-language-action (VLA) models lack intermediate reasoning capabilities, hindering their ability to handle complex manipulation tasks requiring temporal planning. To address this, we propose Vision Chain-of-Thought (Vision-CoT), a novel mechanism that autoregressively predicts multiple future visual target frames prior to action generation, establishing a closed-loop โperception โ reasoning โ actionโ pipeline. This work introduces explicit visual-level chain-of-thought reasoning into VLA frameworks for the first time, overcoming the limitations of end-to-end direct mapping. Built upon a 7B-parameter multimodal large language model, our approach jointly models visual, linguistic, and action tokens within a synergistic architecture combining visual target prediction with short-horizon action sequence generation. Evaluated on real-world robotic manipulation tasks, our method achieves a 17% absolute success rate improvement; on simulation benchmarks, it yields a 6% gain, establishing new state-of-the-art performance among VLA models.
๐ Abstract
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/