ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing referring expression comprehension (REC) methods primarily focus on single-entity localization, struggling to model complex inter-entity relationships and lacking fine-grained image–text–relation annotations. This work proposes a relation-aware multi-entity REC paradigm. First, we introduce ReMeX—the first large-scale dataset with explicit entity-relation annotations. Second, we design the Text-Adaptive Multi-entity Perceiver (TMP), which dynamically infers the number of entities and their spatial boundaries. Third, we incorporate an Entity Interaction Reasoning (EIR) module to jointly perform multi-entity localization and relational understanding. Our approach integrates vision–language joint modeling, auxiliary entity-centric textual data (EntityText) generated by large language models, and fine-grained text parsing. Evaluated on four benchmarks, our method achieves state-of-the-art performance on both multi-entity localization and relational prediction tasks.

Technology Category

Application Category

📝 Abstract

Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.

Problem

Research questions and friction points this paper is trying to address.

Localizing multiple entities with complex inter-entity relationships

Addressing lack of datasets with fine-grained image-text-relation annotations

Resolving semantic ambiguity from implicit entity boundaries in language

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs relation-aware multi-entity dataset ReMeX

Proposes Text-adaptive Multi-entity Perceptron (TMP)

Introduces Entity Inter-relationship Reasoner (EIR)

🔎 Similar Papers

ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget