CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in interactive visual grounding where robots struggle to actively determine when to ask clarifying questions to resolve referential ambiguity. The authors propose a novel approach that explicitly converts the cross-modal attention of a vision-language model (VLM) into spatial signals to detect referential ambiguity and trigger clarification queries. By integrating a lightweight CNN with a LoRA-finetuned decoder, the method achieves parameter-efficient interactive grounding using only InViG supervision. Notably, this study is the first to leverage the internal cross-modal attention of VLMs for spatialized ambiguity detection, outperforming state-of-the-art methods in both ambiguity identification accuracy and overall grounding performance.

Technology Category

Application Category

📝 Abstract
With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue
Problem

Research questions and friction points this paper is trying to address.

Interactive Visual Grounding
referential ambiguity
human-robot interaction
clarification questions
crossmodal disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive Visual Grounding
Cross-modal Attention
Referential Ambiguity
LoRA Fine-tuning
Visual-Language Model
🔎 Similar Papers
No similar papers found.