Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

📅 2026-02-27

📈 Citations: 1

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the limitation of existing referring expression comprehension (REC) benchmarks, which are prone to shortcut strategies—such as reliance on short expressions or low-interference images—and thus fail to adequately evaluate the true vision-language reasoning capabilities of multimodal large language models (MLLMs). To this end, we propose Ref-Adv, a novel REC benchmark that enforces deep semantic-visual alignment by incorporating human-crafted referring expressions with reasoning-intensive elements (e.g., negation and complex syntactic structures) alongside high-difficulty distractor images. Ref-Adv is the first benchmark to systematically suppress shortcut learning, introduce fine-grained reasoning dimension annotations, and support ablation studies via word-order perturbation and description deletion. Evaluations reveal that while leading MLLMs excel on standard REC tasks, their performance drops significantly on Ref-Adv, exposing their dependence on superficial cues and validating the benchmark’s effectiveness in assessing genuine visual reasoning ability.

Technology Category

Application Category

📝 Abstract

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Referring Expression Comprehension

Visual Reasoning

Multimodal LLMs

Grounding

Benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Referring Expression Comprehension

Visual Reasoning

Multimodal LLMs