AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models overuse Chain-of-Thought (CoT) reasoning even in simple scenarios, leading to computational redundancy without decision-making improvement. Method: We propose a dual-mode adaptive inference framework that integrates fast intuitive and slow deliberative reasoning, dynamically activating CoT based on scene complexity. We design a novel adaptive thinking reward function and employ Group Relative Policy Optimization (GRPO)—a reinforcement learning method—to enable the model to autonomously learn “when to reason.” Leveraging a vision-language-action joint pretraining architecture, we introduce dual-path supervised fine-tuning—with and without CoT. Contribution/Results: On the Navsim benchmark, our method achieves a PDMS score of 90.3, outperforming the best pure-vision baseline by 1.7 points, and surpassing “never-reason” and “always-reason” baselines by 2.0 and 1.4 points, respectively, while reducing inference latency by 14%.

Technology Category

Application Category

📝 Abstract
While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.
Problem

Research questions and friction points this paper is trying to address.

Adaptive reasoning for autonomous driving efficiency
Reducing unnecessary computation in simple scenarios
Balancing decision accuracy with inference speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual mode reasoning mechanism for driving
Adaptive Think Reward with GRPO strategy
Pretraining on QA and trajectory datasets
🔎 Similar Papers
No similar papers found.
Y
Yuechen Luo
Tsinghua University
F
Fang Li
Xiaomi EV
Shaoqing Xu
Shaoqing Xu
University of Macau, BUAA, Xiaomi EV
3D Computer Vision3D GenerationVision and Language ModelEnd2EndWorld Model
Z
Zhiyi Lai
Xiaomi EV
L
Lei Yang
Nanyang Technological University
Qimao Chen
Qimao Chen
Master Student of Tsinghua University
autonomous drivingrobotics
Ziang Luo
Ziang Luo
Tsinghua University
Autonomous driving
Z
Zixun Xie
Xiaomi EV, Peking University
S
Shengyin Jiang
Xiaomi EV
J
Jiaxin Liu
Tsinghua University, Xiaomi EV
L
Long Chen
Xiaomi EV
B
Bing Wang
Xiaomi EV
Z
Zhi-xin Yang
University of Macau