VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models in long-horizon tasks, which rely on static visual contexts and purely textual reasoning, thereby struggling to actively revisit visual inputs to resolve ambiguities. To overcome this, the authors propose an “Image Thinking” reasoning framework that, for the first time, models visual perception as a dynamically invocable reasoning action, enabling on-demand revisiting of environmental images during task execution. The approach combines supervised fine-tuning (SFT) for cold-start initialization with GRPO reinforcement learning, leveraging visual chain-of-thought data to align structured reasoning with tool-use behaviors. Evaluated on the LIBERO and RoboTwin 2.0 benchmarks, the method achieves significant performance gains, reaching a 97.5% success rate on LIBERO tasks and demonstrating substantial improvements in long-horizon robotic manipulation.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
embodied intelligence
chain-of-thought reasoning
long-horizon tasks
visual ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

thinking-with-image reasoning
vision-language-action models
visual chain-of-thought
dynamic perception as action
GRPO reinforcement learning
🔎 Similar Papers
No similar papers found.