🤖 AI Summary
This work identifies systematic pragmatic deficiencies in vision-language models (VLMs) for referring expression generation (REG): violations of Grice’s Cooperative Principle lead to non-unique references, redundant descriptions, excessive reliance on spatial cues, and significant divergence from human pragmatic preferences. Methodologically, we first systematically characterize—through a pragmatic lens—three core failure modes of VLMs in REG; second, we introduce RefOI, the first bilingual (written and spoken) multimodal referring-expression dataset with fine-grained pragmatic annotations (1.5k images); third, we propose a Grice-theory–informed pragmatic evaluation framework integrating human judgment with granular error analysis. Experiments across prominent VLMs—including BLIP-2, LLaVA, and Qwen-VL—demonstrate widespread, severe violations of referential uniqueness, informational conciseness, and spatial cue minimization. Crucially, standard automatic metrics (e.g., CIDEr, SPICE) exhibit strong misalignment with human pragmatic judgments.
📝 Abstract
Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.