RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing vision-language benchmarks predominantly rely on local visual cues, failing to rigorously evaluate models’ global image understanding—thereby hindering robust dataset construction and practical multimodal model development. Method: We propose the Region Comprehension Index (RCI), the first metric that quantifies a task’s dependence on global reasoning versus local cues by measuring the performance gap between full-image and localized region inputs. RCI serves as a model-driven, actionable diagnostic tool for identifying spatial biases and quantifying local bias propensity. Contribution/Results: A systematic analysis across 13 mainstream multimodal benchmarks reveals pervasive local bias in most datasets. RCI establishes a new paradigm for building more robust, real-world-oriented multimodal evaluation frameworks, providing both theoretical grounding and empirical validation for mitigating cue-based shortcuts and advancing holistic visual understanding.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development. We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset's reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues. When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluates global vs local reasoning in multimodal benchmarks

Quantifies dataset reliance on holistic vs localized visual cues

Identifies spatial biases in multimodal benchmarks for real-world applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

RCI score quantifies global versus local reasoning

Compares model performance on image patches versus full images

Diagnoses spatial biases in multimodal benchmarks for robust systems

🔎 Similar Papers

No similar papers found.