FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension

📅 2024-09-23
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing referring expression comprehension (REC) benchmarks lack support for fine-grained reasoning, cross-modal rejection (i.e., correctly rejecting unmentioned content), and controllable difficulty assessment. Method: We introduce RefFine—the first REC benchmark enabling multi-level reasoning over object categories, attributes, and multi-hop relationships—built via a tunable-difficulty fine-grained REC framework. It integrates expert annotation with controlled image editing and textual reconstruction to generate high-quality positive and negative samples, including fine-grained negated image-text pairs explicitly designed to evaluate model rejection capability. Contribution/Results: Extensive experiments reveal substantial performance degradation of state-of-the-art REC models and multimodal large language models (MLLMs) on RefFine, confirming its strong challenge and diagnostic utility. RefFine establishes a new evaluation paradigm and data foundation for fine-grained vision-language grounding.

Technology Category

Application Category

📝 Abstract
Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model’s ability to correctly reject scenarios where the target object is not visible in the image—an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Complex Anaphora Resolution
Cross-modal Information Integration
Visual Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

FineCops-Ref
Visual Reasoning
Cross-modal Interaction
🔎 Similar Papers
No similar papers found.