GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While current multimodal large language models (MLLMs) achieve strong performance on standard benchmarks, their visual-language grounding capabilities—particularly in handling ambiguous references, spatial relations, and unlocalizable queries—remain poorly characterized and likely far from human-level proficiency. Method: We introduce GroundingME, the first four-dimensional fine-grained grounding benchmark, evaluating discriminative grounding, spatial relation understanding, constrained-condition grounding, and rejection of unlocalizable queries. We further propose test-time reasoning path re-ranking and rejection-aware data-mixing fine-tuning. Contribution/Results: GroundingME reveals that state-of-the-art MLLMs achieve only 45.1% average grounding accuracy and near-zero rejection capability (0%). Our methods improve complex grounding performance by 2.9% and achieve 27.9% rejection accuracy. The benchmark is constructed via automated generation followed by rigorous human verification, ensuring reproducibility and high evaluation fidelity.

Technology Category

Application Category

📝 Abstract
Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.
Problem

Research questions and friction points this paper is trying to address.

Assesses MLLMs' ability to ground language in vision beyond pattern-matching
Systematically challenges models across discriminative, spatial, limited, and rejection dimensions
Reveals a significant capability gap and safety concerns in real-world visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a benchmark challenging models across four critical dimensions
Uses automated generation with human verification for data curation
Proposes test-time scaling and data-mixture training for improvement
🔎 Similar Papers
No similar papers found.
R
Rang Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
L
Lei Li
LLM-Core Xiaomi
Shuhuai Ren
Shuhuai Ren
Peking University
Deep LearningNatural Language Processing
H
Hao Tian
LLM-Core Xiaomi
Shuhao Gu
Shuhao Gu
Xiaomi
LLMVision-Language ModelAGI
S
Shicheng Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zihao Yue
Zihao Yue
Renmin University of China
Multimodal AILanguage Modeling
Y
Yudong Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
W
Wenhan Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhe Yang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jingyuan Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhifang Sui
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
F
Fuli Luo
LLM-Core Xiaomi