Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how multimodal large language models (MLLMs) integrate localized spatial observations from distinct viewpoints through natural language dialogue to construct a coherent global spatial understanding. To this end, the authors introduce the COSMIC benchmark, which tasks two static MLLM agents with collaboratively answering spatial questions about 3D indoor scenes through dialogue, starting from different viewpoints. The work presents the first systematic evaluation of MLLMs on core capabilities including landmark identification, relational reasoning, and global map construction. Experimental results show that the state-of-the-art model Gemini-3-Pro-Thinking achieves an overall accuracy of 72%, substantially below human performance at 95%; while it performs reasonably well in landmark identification, its ability to build consistent global maps is nearly random. By incorporating human dialogues as a reference benchmark, this work reveals fundamental limitations of current MLLMs in collaborative spatial communication.
📝 Abstract
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic
Problem

Research questions and friction points this paper is trying to address.

spatial communication
multimodal large language models
shared mental model
egocentric views
spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Spatial Communication
COSMIC Benchmark
Egocentric-to-Allocentric Integration
Collaborative Dialogue
🔎 Similar Papers
No similar papers found.
A
Ankur Sikarwar
Mila – Quebec AI Institute
D
Debangan Mishra
IIIT Hyderabad
S
Sudarshan Nikhil
IIIT Hyderabad
P
Ponnurangam Kumaraguru
IIIT Hyderabad
Aishwarya Agrawal
Aishwarya Agrawal
University of Montreal, Mila, Google DeepMind
Artificial IntelligenceMultimodal Vision-LanguageComputer VisionNLPDeep Learning