Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This paper addresses Generalized Referring Expression Comprehension (GREC), a novel challenge requiring unified referring expression localization across zero-, single-, and multi-target scenarios—extending beyond conventional REC’s fixed-output and weak cross-modal representation limitations. We propose a Hierarchical Multimodal Semantic Alignment (HMSA) module that enables fine-grained alignment at three levels: word-to-object, phrase-to-object, and text-to-image. Additionally, we design an Adaptive Grounding Counter (AGC) to dynamically predict the number of target objects and introduce a counting-aware contrastive loss to enhance multi-target discrimination. Our method is trained via multi-task joint optimization and achieves state-of-the-art performance on GREC. Moreover, it attains consistent top-tier results across five diverse benchmarks—including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES)—demonstrating superior generalization and robustness.

Technology Category

Application Category

📝 Abstract

In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.

Problem

Research questions and friction points this paper is trying to address.

Generalized Referring Expression Comprehension

Multiple Targets

Limitations of Traditional REC Methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

HieA2G

HMSA

AGC

🔎 Similar Papers

No similar papers found.