Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses Generalized Referring Expression Comprehension (GREC), a novel challenge requiring unified referring expression localization across zero-, single-, and multi-target scenarios—extending beyond conventional REC’s fixed-output and weak cross-modal representation limitations. We propose a Hierarchical Multimodal Semantic Alignment (HMSA) module that enables fine-grained alignment at three levels: word-to-object, phrase-to-object, and text-to-image. Additionally, we design an Adaptive Grounding Counter (AGC) to dynamically predict the number of target objects and introduce a counting-aware contrastive loss to enhance multi-target discrimination. Our method is trained via multi-task joint optimization and achieves state-of-the-art performance on GREC. Moreover, it attains consistent top-tier results across five diverse benchmarks—including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES)—demonstrating superior generalization and robustness.

Technology Category

Application Category

📝 Abstract
In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.
Problem

Research questions and friction points this paper is trying to address.

Generalized Referring Expression Comprehension
Multiple Targets
Limitations of Traditional REC Methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

HieA2G
HMSA
AGC
🔎 Similar Papers
No similar papers found.
Yaxian Wang
Yaxian Wang
Chang'an University, Xi'an Jiaotong University
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC
Shuting He
Shuting He
Assistant Professor, Shanghai University of Finance and Economics
Computer Vision
X
Xudong Jiang
Nanyang Technological University, Singapore
B
Bifan Wei
School of Continuing Education, Xi’an Jiaotong University, China; Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, China
J
Jun Liu
School of Computer Science and Technology, Xi’an Jiaotong University, China; Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University, China