Measuring How (Not Just Whether) VLMs Build Common Ground

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing vision-language model (VLM) evaluations focus predominantly on single-turn question answering, neglecting the incremental co-construction of shared understanding inherent in natural human–AI interaction. Method: We propose the first multidimensional evaluation framework tailored to interactive settings, assessing grounding efficiency, content alignment, lexical adaptation, and humanness. Our methodology integrates interactive referring expression games, self-play conversational analysis, and human–machine comparative experiments, systematically quantifying model behavior across 150 self-play dialogues. Contribution/Results: All current VLMs exhibit significant divergence from human interaction patterns; while GPT-4o-mini performs closest to humans, it still shows substantial deficits along critical dimensions. Our findings expose a fundamental decoupling between task success and semantic alignment—highlighting that high accuracy in static benchmarks does not imply robust interactive competence. This work establishes a new paradigm for evaluating VLMs’ interactive capabilities beyond conventional task completion or static alignment metrics.

Technology Category

Application Category

📝 Abstract

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs in interactive grounding contexts

Assessing how VLMs build common ground through communication

Measuring VLM performance beyond single-turn question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates VLMs with four-metric suite in interactive grounding

Compares three proprietary VLMs against human dyads using games

Reveals task success and image alignment not predict grounding

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions