ThinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses critical limitations in existing autonomous driving approaches—namely, loose reasoning, poor generalization, and misalignment with human intent—and overcomes the inability of conventional fine-tuning to fully harness the potential of chain-of-thought (CoT) reasoning, as well as the instability and shallow inference plaguing reinforcement learning. The authors propose a CoT-guided progressive reinforcement learning fine-tuning framework that first performs supervised fine-tuning using CoT rationales and then employs a difficulty-aware adaptive optimizer to dynamically modulate learning intensity, thereby enabling synergistic optimization of explicit reasoning and policy learning. Evaluated on public benchmarks, the method substantially outperforms strong baselines, achieving gains of 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy metrics, respectively; notably, a 2B-parameter variant surpasses GPT-4o by 3.28% on the exam metric.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of large language models (LLMs) technologies, their application in the domain of autonomous driving has become increasingly widespread. However, existing methods suffer from unstructured reasoning, poor generalization, and misalignment with human driving intent. While Chain-of-Thought (CoT) reasoning enhances decision transparency, conventional supervised fine-tuning (SFT) fails to fully exploit its potential, and reinforcement learning (RL) approaches face instability and suboptimal reasoning depth. We propose ThinkDrive, a CoT guided progressive RL fine-tuning framework for autonomous driving that synergizes explicit reasoning with difficulty-aware adaptive policy optimization. Our method employs a two-stage training strategy. First, we perform SFT using CoT explanations. Then, we apply progressive RL with a difficulty-aware adaptive policy optimizer that dynamically adjusts learning intensity based on sample complexity. We evaluate our approach on a public dataset. The results show that ThinkDrive outperforms strong RL baselines by 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy, respectively. Moreover, a 2B-parameter model trained with our method surpasses the much larger GPT-4o by 3.28% on the exam metric.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

Chain-of-Thought

reinforcement learning

reasoning transparency

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

Progressive Reinforcement Learning

Difficulty-aware Optimization