AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address critical bottlenecks of Vision-Language Models (VLMs) in autonomous driving—including hallucination, inefficient reasoning, and lack of real-world validation—this paper proposes the first unified framework integrating Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation. Methodologically, it introduces: (1) the first domain-specific autonomous-driving tool library, coupled with automated generation of structured, self-verified tool-augmented reasoning data; (2) a two-stage training paradigm—Supervised Fine-Tuning (SFT) followed by Generalized Reinforcement Policy Optimization (GRPO)—to enable VLMs to invoke tools autonomously and reliably; and (3) a multi-tool collaborative evaluation protocol. Evaluated on DriveLMM-o1, the framework achieves a 53.91% improvement in reasoning score and a 33.54% gain in answer accuracy, significantly enhancing reasoning quality, logical consistency, and zero-/few-shot generalization capability.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce extbf{AgentThink}, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: extbf{(i) Structured Data Generation}, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; extbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and extbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by extbf{53.91%} and enhances answer accuracy by extbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations in Vision-Language Models for autonomous driving

Improves inefficient reasoning in autonomous driving tasks

Enhances real-world validation of step-by-step reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Chain-of-Thought reasoning with dynamic tool invocation

Uses structured data generation for self-verified reasoning

Employs two-stage training with SFT and GRPO

🔎 Similar Papers

No similar papers found.