FlowVLA: Thinking in Motion with a Visual Chain of Thought

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current vision-language-action (VLA) models rely on next-frame prediction to build world models, but entangle appearance and motion representations—leading to weak physical reasoning, distorted visual predictions, and inefficient policy learning. To address this, we propose Visual CoT, a novel pretraining framework introducing the “visual chain-of-thought” mechanism: within a single Transformer, scene dynamics are explicitly modeled as optical flow, enforcing motion reasoning *before* future-frame generation to decouple appearance and motion. Our method adopts an autoregressive Transformer architecture, treating optical flow prediction as an interpretable intermediate representation, and jointly optimizes vision-language-action tasks. Evaluated on challenging robotic manipulation benchmarks, Visual CoT achieves state-of-the-art performance with significantly improved sample efficiency—demonstrating dual advantages in physically plausible world modeling and efficient policy learning.

Technology Category

Application Category

📝 Abstract

Many Vision-Language-Action (VLA) models rely on an internal world model trained via next-frame prediction. This approach, however, struggles with physical reasoning as it entangles static appearance with dynamic motion, often resulting in implausible visual forecasts and inefficient policy learning. To address these limitations, we introduce the Visual Chain of Thought (Visual CoT): a pre-training framework that encourages a model to reason about how a scene evolves before predicting what it will look like. We instantiate this principle in FlowVLA, which predicts a future frame ($v_{t+1}$) only after generating an intermediate optical flow representation ($f_t$) that encodes motion dynamics. This ``$v_t ightarrow f_t ightarrow v_{t+1}$'' reasoning process is implemented within a single autoregressive Transformer, guiding the model to learn disentangled dynamics. As a result, FlowVLA produces coherent visual predictions and facilitates more efficient policy learning. Experiments on challenging robotics manipulation benchmarks demonstrate state-of-the-art performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling. Project page: https://irpn-lab.github.io/FlowVLA/

Problem

Research questions and friction points this paper is trying to address.

Disentangling static appearance from dynamic motion in VLAs

Improving physical reasoning through intermediate optical flow representation

Enhancing visual prediction coherence and policy learning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Chain of Thought reasoning framework

Optical flow intermediate representation prediction

Single autoregressive Transformer implementation

🔎 Similar Papers

No similar papers found.