CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

๐Ÿ“… 2025-03-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language-action (VLA) models lack intermediate reasoning capabilities, hindering their ability to handle complex manipulation tasks requiring temporal planning. To address this, we propose Vision Chain-of-Thought (Vision-CoT), a novel mechanism that autoregressively predicts multiple future visual target frames prior to action generation, establishing a closed-loop โ€œperception โ†’ reasoning โ†’ actionโ€ pipeline. This work introduces explicit visual-level chain-of-thought reasoning into VLA frameworks for the first time, overcoming the limitations of end-to-end direct mapping. Built upon a 7B-parameter multimodal large language model, our approach jointly models visual, linguistic, and action tokens within a synergistic architecture combining visual target prediction with short-horizon action sequence generation. Evaluated on real-world robotic manipulation tasks, our method achieves a 17% absolute success rate improvement; on simulation benchmarks, it yields a 6% gain, establishing new state-of-the-art performance among VLA models.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/
Problem

Research questions and friction points this paper is trying to address.

Enhances VLAs with visual reasoning for complex tasks
Addresses lack of temporal planning in current VLAs
Improves performance in real-world and simulated manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates visual chain-of-thought reasoning
Predicts future image frames autoregressively
Generates action sequences for visual goals
๐Ÿ”Ž Similar Papers
No similar papers found.