🤖 AI Summary
This study investigates the capacity of vision-language models (VLMs) to perform context-sensitive pragmatic reasoning in multi-turn dialogues, using iterative reference games as a canonical pragmatic task. Methodologically, we systematically manipulate contextual factors—including quantity, ordering, and relevance—and conduct few-shot evaluations against human baselines. Results show that relevant contextual information substantially improves VLM performance, enabling pragmatic inference accuracy approaching human-level competence; however, models remain significantly deficient when relevant context is absent or when resolving abstract references. This work provides the first empirical evidence that contextual relevance is a decisive factor in VLM pragmatic understanding, identifies dynamic context modeling as a critical bottleneck in current architectures, and establishes a rigorous evaluation framework to assess multimodal pragmatic reasoning. The findings offer concrete empirical grounding for developing more human-consistent, context-aware VLMs.
📝 Abstract
Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents'ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.