Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This paper addresses multimodal referential ambiguity in visual dialogue caused by pronouns and ellipsis, proposing the first unified framework jointly modeling textual coreference resolution (including coreference resolution and predicate-argument structure) and multimodal referring expression grounding. Methodologically, it employs cross-modal embedding mapping to drive mention-object matching via similarity between textual mention embeddings and visual object embeddings, explicitly incorporating linguistic semantic structure into visual grounding. Key contributions include: (i) the first empirical demonstration that explicit textual referential structure significantly improves pronoun phrase grounding performance and enhances model confidence discrimination for ambiguous references; (ii) strong positive transferability of the module to multimodal referring expression grounding. Our approach outperforms MDETR and GLIP on pronoun phrase grounding, with qualitative analysis confirming its effectiveness in mitigating visual referential ambiguity.

Technology Category

Application Category

📝 Abstract

Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.

Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguous references in visually grounded dialogues

Integrating textual and multimodal semantic structures

Improving pronoun phrase grounding accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint modeling of textual and multimodal semantic structures

Mapping mention embeddings to object embeddings

Incorporating coreference resolution for pronoun grounding

🔎 Similar Papers

No similar papers found.