🤖 AI Summary
Current large vision-language models (LVLMs) exhibit weak visual understanding, excessive reliance on textual cues, and insufficient cross-modal reasoning capabilities in real-world complex scenarios. To address this, we introduce DRIVINGVQA—the first visual chain-of-reasoning benchmark for driving theory examinations—comprising 3,931 expert-crafted multiple-choice questions with entity-aligned stepwise explanations. We reformulate driving exams as visual chain-of-reasoning tasks; propose an entity-guided image region cropping strategy to enhance perception of critical visual elements; and design an entity-aware image token fine-tuning method. Experiments reveal severe zero-shot performance limitations of mainstream open- and closed-source LVLMs on this benchmark. Our approach improves visual chain-of-reasoning accuracy by up to 7%, establishing a novel paradigm for trustworthy reasoning evaluation and optimization of LVLMs in high-stakes, safety-critical real-world applications.
📝 Abstract
Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs' ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7% when reasoning over image tokens of cropped regions tied to these entities.