🤖 AI Summary
Existing referring expression comprehension (REC) benchmarks lack support for fine-grained reasoning, cross-modal rejection (i.e., correctly rejecting unmentioned content), and controllable difficulty assessment. Method: We introduce RefFine—the first REC benchmark enabling multi-level reasoning over object categories, attributes, and multi-hop relationships—built via a tunable-difficulty fine-grained REC framework. It integrates expert annotation with controlled image editing and textual reconstruction to generate high-quality positive and negative samples, including fine-grained negated image-text pairs explicitly designed to evaluate model rejection capability. Contribution/Results: Extensive experiments reveal substantial performance degradation of state-of-the-art REC models and multimodal large language models (MLLMs) on RefFine, confirming its strong challenge and diagnostic utility. RefFine establishes a new evaluation paradigm and data foundation for fine-grained vision-language grounding.
📝 Abstract
Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model’s ability to correctly reject scenarios where the target object is not visible in the image—an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs.