VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) benchmarks emphasize object recognition while neglecting pure visual reasoning capabilities and remain susceptible to linguistic priors and domain knowledge biases. To address this, we propose VERIFY—the first benchmark explicitly designed to evaluate visual reasoning fidelity. VERIFY enforces image-only abstract reasoning by employing minimal textual prompts and incorporating human-annotated, fine-grained reasoning paths. We introduce the first formal metric for visual reasoning fidelity, grounded in visual dependency and logical consistency—moving beyond conventional accuracy-centric evaluation paradigms. Comprehensive evaluation of mainstream MLLMs reveals pervasive reliance on “linguistic shortcuts,” leading to significant degradation in genuine visual reasoning performance. Crucially, VERIFY sensitively detects this phenomenon, establishing a new standard for interpretable, bias-mitigated multimodal reasoning assessment. (149 words)

Technology Category

Application Category

📝 Abstract

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).

Problem

Research questions and friction points this paper is trying to address.

Assess true visual reasoning capabilities of MLLMs

Introduce VERIFY benchmark for rigorous evaluation

Propose novel metrics for visual reasoning fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

VERIFY benchmark isolates visual reasoning capabilities

Minimizes textual context to reduce linguistic biases

Introduces novel metrics beyond accuracy for reasoning

🔎 Similar Papers

No similar papers found.