DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autonomous driving reasoning approaches struggle to simultaneously achieve fine-grained modeling of environmental dynamics and efficient decision-making. This work proposes DynVLA, a novel model that introduces the “Dynamics Chain-of-Thought” (Dynamics CoT) paradigm. By decoupling the ego-vehicle’s actions from environmental dynamics, DynVLA first predicts a compact representation of future world states before generating driving commands, thereby enhancing both the physical plausibility and computational efficiency of decisions. The model incorporates dynamic token compression to represent future states concisely and leverages a dynamic tokenizer alongside supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Evaluated on NAVSIM, Bench2Drive, and a large-scale internal dataset, DynVLA significantly outperforms existing methods such as Textual CoT and Visual CoT.

Technology Category

Application Category

📝 Abstract
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
Problem

Research questions and friction points this paper is trying to address.

Autonomous Driving
World Dynamics
Action Reasoning
Chain-of-Thought
Spatiotemporal Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamics CoT
Dynamics Tokenizer
world dynamics modeling
VLA
autonomous driving
🔎 Similar Papers
No similar papers found.
S
Shuyao Shang
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
B
Bing Zhan
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Y
Yunfei Yan
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Y
Yuqi Wang
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Yingyan Li
Yingyan Li
Institute of Automation, Chinese Academy of Sciences
computer vision
Y
Yasong An
Yinwang Intelligent Technology Co. Ltd.
X
Xiaoman Wang
Yinwang Intelligent Technology Co. Ltd.
J
Jierui Liu
Yinwang Intelligent Technology Co. Ltd.
L
Lu Hou
Yinwang Intelligent Technology Co. Ltd.
L
Lue Fan
NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
Tieniu Tan
Tieniu Tan
Institute of Automation, Chinese Academy of Sciences