LVLMs and Humans Ground Differently in Referential Communication

๐Ÿ“… 2026-01-27
๐Ÿ“ˆ Citations: 1
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the limitations of large vision-language models (LVLMs) in effectively modeling shared grounding during collaborative tasks, which constrains their ability to understand and generate referring expressions. Through a director-matcher paradigm, the work systematically compares multi-turn referential communication performance across four pairing typesโ€”human-human, human-AI, AI-human, and AI-AIโ€”in a label-free image matching task. Analyzing 356 dialogues from 89 participant pairs using a factorial design, an interactive online platform, and specialized tools for referring expression analysis, the study reveals that LVLMs are significantly weaker than humans in dynamically establishing common ground, highlighting deficiencies in their language grounding capabilities. The project publicly releases the full experimental pipeline and dialogue corpus, providing a benchmark resource for future research.

Technology Category

Application Category

๐Ÿ“ Abstract
For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs'limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
Problem

Research questions and friction points this paper is trying to address.

referential communication
common ground
large vision-language models
human-AI collaboration
referring expressions
Innovation

Methods, ideas, or system contributions that make the work stand out.

referential communication
large vision-language models
common ground
interactive dialogue
lexical overlap
๐Ÿ”Ž Similar Papers
No similar papers found.
Peter Zeng
Peter Zeng
Stony Brook University
W
Weiling Li
Department of Psychology, Stony Brook University
A
Amie Paige
Department of Psychology, Stony Brook University
Z
Zhengxiang Wang
Department of Linguistics, Stony Brook University; Institute for Advanced Computational Science, Stony Brook University
P
Panagiotis Kaliosis
Department of Computer Science, Stony Brook University
Dimitris Samaras
Dimitris Samaras
Stony Brook University
Computer VisionMachine LearningComputer GraphicsMedical Imaging
Gregory Zelinsky
Gregory Zelinsky
Professor of Psychology and Computer Science, Stony Brook University
visual attentionvisual searchobject detection
S
Susan E. Brennan
Department of Psychology, Stony Brook University
Owen Rambow
Owen Rambow
Stony Brook University
Natural Language ProcessingComputational LinguisticsComputational Social Science