CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models for end-to-end autonomous driving suffer from weak numerical reasoning and oversimplified input-output mappings, hindering causal reasoning in complex traffic scenarios. To address this, we propose CoT4AD—a novel framework that introduces explicit chain-of-thought (CoT) reasoning into VLA modeling for autonomous driving, the first of its kind. CoT4AD establishes a multi-stage reasoning chain: “perception → problem formulation → prediction → action,” unifying semantic understanding, scene modeling, and trajectory planning. By jointly learning vision-language-action representations and integrating explicit CoT supervision with implicit reasoning mechanisms, it aligns the reasoning space with the action space. Evaluated on nuScenes and Bench2Drive benchmarks, CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop settings, significantly improving decision robustness and trajectory accuracy—especially under dynamic environmental conditions.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Enhances numerical and causal reasoning in autonomous driving VLMs
Improves step-by-step reasoning for complex driving scenarios
Aligns reasoning and action spaces across multiple driving tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Chain-of-Thought reasoning for autonomous driving
Explicitly models perception-question-prediction-action reasoning chain
Enables consistent numerical reasoning in dynamic environments
🔎 Similar Papers
No similar papers found.
Z
Zhaohui Wang
Peking University
Tengbo Yu
Tengbo Yu
Tsinghua University
VLAComputer VisionEmbodied AI
H
Hao Tang
Peking University