🤖 AI Summary
To address critical challenges in Grounded Referring Expression Comprehension (GREC)—including scarcity of dialogue-grounded data, difficulty in resolving coreference over long contexts, and severe train-test domain distribution shift—this paper proposes a three-tier controllable data synthesis framework. The framework systematically generates large-scale, diverse, and finely annotated dialogue-based referring data while preserving visual-linguistic fidelity. Our method integrates vision-language alignment modeling with explicit coreference resolution mechanisms to enable robust, cross-scene and cross-turn referring expression localization. Evaluated on standard benchmarks, our approach significantly outperforms existing state-of-the-art methods, achieving an +8.2% improvement in mean Average Precision (mAP). Notably, it demonstrates superior generalization in long-dialogue and out-of-distribution scenarios. This work establishes a scalable paradigm for both data curation and modeling in dialogue-driven visual referring understanding.
📝 Abstract
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.