ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing referring expression comprehension (REC) methods primarily focus on single-entity localization, struggling to model complex inter-entity relationships and lacking fine-grained image–text–relation annotations. This work proposes a relation-aware multi-entity REC paradigm. First, we introduce ReMeX—the first large-scale dataset with explicit entity-relation annotations. Second, we design the Text-Adaptive Multi-entity Perceiver (TMP), which dynamically infers the number of entities and their spatial boundaries. Third, we incorporate an Entity Interaction Reasoning (EIR) module to jointly perform multi-entity localization and relational understanding. Our approach integrates vision–language joint modeling, auxiliary entity-centric textual data (EntityText) generated by large language models, and fine-grained text parsing. Evaluated on four benchmarks, our method achieves state-of-the-art performance on both multi-entity localization and relational prediction tasks.

Technology Category

Application Category

📝 Abstract
Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.
Problem

Research questions and friction points this paper is trying to address.

Localizing multiple entities with complex inter-entity relationships
Addressing lack of datasets with fine-grained image-text-relation annotations
Resolving semantic ambiguity from implicit entity boundaries in language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs relation-aware multi-entity dataset ReMeX
Proposes Text-adaptive Multi-entity Perceptron (TMP)
Introduces Entity Inter-relationship Reasoner (EIR)
🔎 Similar Papers
No similar papers found.
Y
Yizhi Hu
Beijing University of Posts and Telecommunications
Z
Zezhao Tian
Beijing University of Posts and Telecommunications
Xingqun Qi
Xingqun Qi
The Hong Kong University of Science and Technology (HKUST)
Computer VisionHuman Motion ModelingMedical Image Analysis
Chen Su
Chen Su
PhD candidate in College of Optical Science and Engineering, Zhejiang University
3D displayHCI
B
Bingkun Yang
Beijing University of Posts and Telecommunications
J
Junhui Yin
Beijing University of Posts and Telecommunications
Muyi Sun
Muyi Sun
School of AI, BUPT (<< NLPR CASIA << BUPT)
Multi-Modality LearningComputer VisionBiometricsMedical Image Analysis
M
Man Zhang
Beijing University of Posts and Telecommunications
Zhenan Sun
Zhenan Sun
Institute of Automation, Chinese Academy of Sciences
BiometricsPattern RecognitionComputer Vision