🤖 AI Summary
Existing vision-language model (VLM) evaluations focus predominantly on single-turn question answering, neglecting the incremental co-construction of shared understanding inherent in natural human–AI interaction. Method: We propose the first multidimensional evaluation framework tailored to interactive settings, assessing grounding efficiency, content alignment, lexical adaptation, and humanness. Our methodology integrates interactive referring expression games, self-play conversational analysis, and human–machine comparative experiments, systematically quantifying model behavior across 150 self-play dialogues. Contribution/Results: All current VLMs exhibit significant divergence from human interaction patterns; while GPT-4o-mini performs closest to humans, it still shows substantial deficits along critical dimensions. Our findings expose a fundamental decoupling between task success and semantic alignment—highlighting that high accuracy in static benchmarks does not imply robust interactive competence. This work establishes a new paradigm for evaluating VLMs’ interactive capabilities beyond conventional task completion or static alignment metrics.
📝 Abstract
Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.