LVLMs and Humans Ground Differently in Referential Communication

📅 2026-01-27

📈 Citations: 1

✨ Influential: 1

career value

214K/year

🤖 AI Summary

This study addresses the limitations of large vision-language models (LVLMs) in effectively modeling shared grounding during collaborative tasks, which constrains their ability to understand and generate referring expressions. Through a director-matcher paradigm, the work systematically compares multi-turn referential communication performance across four pairing types—human-human, human-AI, AI-human, and AI-AI—in a label-free image matching task. Analyzing 356 dialogues from 89 participant pairs using a factorial design, an interactive online platform, and specialized tools for referring expression analysis, the study reveals that LVLMs are significantly weaker than humans in dynamically establishing common ground, highlighting deficiencies in their language grounding capabilities. The project publicly releases the full experimental pipeline and dialogue corpus, providing a benchmark resource for future research.

Technology Category

Application Category

📝 Abstract

For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs'limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.

Problem

Research questions and friction points this paper is trying to address.

referential communication

common ground

large vision-language models

human-AI collaboration

referring expressions

Innovation

Methods, ideas, or system contributions that make the work stand out.

referential communication

large vision-language models

common ground