🤖 AI Summary
In generalized visual grounding, existing approaches suffer from fragmented modeling of referring expression comprehension (GREC) and referring expression segmentation (GRES), insufficient instance awareness, and inconsistent multi-granularity predictions—particularly in multi-object and non-target scenarios.
Method: We propose the first instance-aware joint learning framework, unifying point-, box-, and mask-level predictions within a Transformer architecture via instance queries and prior reference points. A reference-point-driven matching strategy is introduced to enforce instance-level box-mask consistency and enable cross-granularity optimization.
Contribution/Results: Our method achieves state-of-the-art performance across four tasks and ten benchmark datasets, significantly outperforming prior works in multiple metrics. It demonstrates markedly improved generalization and robustness in complex, real-world scenarios—especially under multi-object and non-target conditions—while establishing a new paradigm for holistic, instance-centric visual grounding.
📝 Abstract
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.