๐ค AI Summary
This study addresses the limitations of large vision-language models (LVLMs) in effectively modeling shared grounding during collaborative tasks, which constrains their ability to understand and generate referring expressions. Through a director-matcher paradigm, the work systematically compares multi-turn referential communication performance across four pairing typesโhuman-human, human-AI, AI-human, and AI-AIโin a label-free image matching task. Analyzing 356 dialogues from 89 participant pairs using a factorial design, an interactive online platform, and specialized tools for referring expression analysis, the study reveals that LVLMs are significantly weaker than humans in dynamically establishing common ground, highlighting deficiencies in their language grounding capabilities. The project publicly releases the full experimental pipeline and dialogue corpus, providing a benchmark resource for future research.
๐ Abstract
For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs'limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.