DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) exhibit weak visual understanding, excessive reliance on textual cues, and insufficient cross-modal reasoning capabilities in real-world complex scenarios. To address this, we introduce DRIVINGVQA—the first visual chain-of-reasoning benchmark for driving theory examinations—comprising 3,931 expert-crafted multiple-choice questions with entity-aligned stepwise explanations. We reformulate driving exams as visual chain-of-reasoning tasks; propose an entity-guided image region cropping strategy to enhance perception of critical visual elements; and design an entity-aware image token fine-tuning method. Experiments reveal severe zero-shot performance limitations of mainstream open- and closed-source LVLMs on this benchmark. Our approach improves visual chain-of-reasoning accuracy by up to 7%, establishing a novel paradigm for trustworthy reasoning evaluation and optimization of LVLMs in high-stakes, safety-critical real-world applications.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs' ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7% when reasoning over image tokens of cropped regions tied to these entities.
Problem

Research questions and friction points this paper is trying to address.

Visual-Linguistic Representation
Real-world Complexity
Multimodal Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

DrivingVQA
LVLMs Evaluation
Key Element Focusing
🔎 Similar Papers
No similar papers found.