AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks lack systematic evaluation of multimodal large language models’ (MLLMs) chain-of-thought (CoT) reasoning capabilities under adverse weather and complex traffic conditions. Method: We propose the first hierarchical, autonomous-driving-oriented CoT benchmark, introducing a fine-grained annotation paradigm that jointly leverages text-, point-, and region-level visual prompts; we atomically decompose reasoning steps and provide explicit ground-truth annotations. Our pipeline integrates human-curated labeling, hierarchical prompt engineering, and a multi-granularity evaluation framework—covering data collection, annotation, reasoning decomposition, and metric design. Contribution/Results: We release over 5,400 high-quality CoT samples. Experiments reveal that current state-of-the-art MLLMs achieve <60% accuracy, confirming the benchmark’s rigor and highlighting the gap in explainable, multi-step reasoning under challenging conditions. This work fills a critical void in robust, interpretable MLLM evaluation for autonomous driving and advances research toward trustworthy, weather-resilient systems.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning has emerged as a powerful approach to enhance the structured, multi-step decision-making capabilities of Multi-Modal Large Models (MLLMs), is particularly crucial for autonomous driving with adverse weather conditions and complex traffic environments. However, existing benchmarks have largely overlooked the need for rigorous evaluation of CoT processes in these specific and challenging scenarios. To address this critical gap, we introduce AD^2-Bench, the first Chain-of-Thought benchmark specifically designed for autonomous driving with adverse weather and complex scenes. AD^2-Bench is meticulously constructed to fulfill three key criteria: comprehensive data coverage across diverse adverse environments, fine-grained annotations that support multi-step reasoning, and a dedicated evaluation framework tailored for assessing CoT performance. The core contribution of AD^2-Bench is its extensive collection of over 5.4k high-quality, manually annotated CoT instances. Each intermediate reasoning step in these annotations is treated as an atomic unit with explicit ground truth, enabling unprecedented fine-grained analysis of MLLMs' inferential processes under text-level, point-level, and region-level visual prompts. Our comprehensive evaluation of state-of-the-art MLLMs on AD^2-Bench reveals accuracy below 60%, highlighting the benchmark's difficulty and the need to advance robust, interpretable end-to-end autonomous driving systems. AD^2-Bench thus provides a standardized evaluation platform, driving research forward by improving MLLMs' reasoning in autonomous driving, making it an invaluable resource.
Problem

Research questions and friction points this paper is trying to address.

Evaluating CoT reasoning in adverse driving conditions
Lack of benchmarks for multi-step decision-making in MLLMs
Assessing MLLMs' inferential processes under complex visual prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical CoT benchmark for adverse driving conditions
5.4k annotated CoT instances with fine-grained steps
Multi-level visual prompts for MLLM reasoning analysis
🔎 Similar Papers
No similar papers found.