🤖 AI Summary
Despite strong performance in visual question answering (VQA), it remains unclear whether large vision-language models (VLMs) genuinely ground their reasoning in visual evidence. Method: We introduce MagiC, the first comprehensive benchmark systematically evaluating VLMs’ grounded reasoning across three dimensions: answer accuracy, reasoning validity, and visual localization consistency. We propose two novel metrics—MagiScore (measuring alignment between reasoning chains and visual evidence) and StepSense (quantifying self-correction capability)—and design adversarial visual perturbation tests to assess robustness. The dataset comprises 5,500 weakly supervised and 900 human-annotated samples, supporting multi-dimensional evaluation protocols. Contribution/Results: Evaluating 15 VLMs (7B–70B parameters), we identify fundamental flaws: pervasive misalignment between reasoning steps and visual inputs, and limited self-correction ability. MagiC establishes the first reproducible, grounded-reasoning–focused benchmark and diagnostic framework for multimodal reasoning research.
📝 Abstract
Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.