Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address critical challenges in Grounded Referring Expression Comprehension (GREC)—including scarcity of dialogue-grounded data, difficulty in resolving coreference over long contexts, and severe train-test domain distribution shift—this paper proposes a three-tier controllable data synthesis framework. The framework systematically generates large-scale, diverse, and finely annotated dialogue-based referring data while preserving visual-linguistic fidelity. Our method integrates vision-language alignment modeling with explicit coreference resolution mechanisms to enable robust, cross-scene and cross-turn referring expression localization. Evaluated on standard benchmarks, our approach significantly outperforms existing state-of-the-art methods, achieving an +8.2% improvement in mean Average Precision (mAP). Notably, it demonstrates superior generalization in long-dialogue and out-of-distribution scenarios. This work establishes a scalable paradigm for both data curation and modeling in dialogue-driven visual referring understanding.

Technology Category

Application Category

📝 Abstract

Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of annotated dialogue grounding data

Improves model performance under domain distribution shifts

Synthesizes scalable supervision for dialogue-conditioned grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-tier data synthesis method

Balances realism and controllability

Produces scalable supervision for grounding

🔎 Similar Papers

Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding