Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenge of representational blur in existing dialogue systems, which often struggle to maintain a consistent shared context during long-term interactions due to reliance on purely textual representations. Inspired by human mental imagery, the authors propose an active visual scaffolding framework that incrementally externalizes dialogue states into persistent, traceable visual depictions. These concrete visual histories are integrated with propositional textual information to construct an explicit, multimodal common ground representation. The approach introduces, for the first time, a depictive intermediate representation analogous to mental imagery, effectively mitigating semantic flattening. Experiments on the IndiRef benchmark demonstrate that the proposed framework substantially outperforms full-dialogue reasoning, significantly reducing semantic ambiguity and enhancing contextual consistency, with the hybrid multimodal configuration achieving the best overall performance.

Technology Category

Application Category

📝 Abstract

Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

Problem

Research questions and friction points this paper is trying to address.

common ground

situated dialogue

representational blur

mental imagery

multimodal representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

mental imagery

visual scaffolding

common ground