Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies systematic pragmatic deficiencies in vision-language models (VLMs) for referring expression generation (REG): violations of Grice’s Cooperative Principle lead to non-unique references, redundant descriptions, excessive reliance on spatial cues, and significant divergence from human pragmatic preferences. Methodologically, we first systematically characterize—through a pragmatic lens—three core failure modes of VLMs in REG; second, we introduce RefOI, the first bilingual (written and spoken) multimodal referring-expression dataset with fine-grained pragmatic annotations (1.5k images); third, we propose a Grice-theory–informed pragmatic evaluation framework integrating human judgment with granular error analysis. Experiments across prominent VLMs—including BLIP-2, LLaVA, and Qwen-VL—demonstrate widespread, severe violations of referential uniqueness, informational conciseness, and spatial cue minimization. Crucially, standard automatic metrics (e.g., CIDEr, SPICE) exhibit strong misalignment with human pragmatic judgments.

Technology Category

Application Category

📝 Abstract
Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.
Problem

Research questions and friction points this paper is trying to address.

VLMs fail to uniquely identify referents in REG tasks
VLMs include excessive irrelevant information in expressions
VLMs misalign with human pragmatic communication preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RefOI dataset with 1.5k images
Evaluates VLMs for pragmatic competence failures
Advocates pragmatically informed models and evaluations
🔎 Similar Papers
No similar papers found.