Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work identifies systematic pragmatic deficiencies in vision-language models (VLMs) for referring expression generation (REG): violations of Grice’s Cooperative Principle lead to non-unique references, redundant descriptions, excessive reliance on spatial cues, and significant divergence from human pragmatic preferences. Methodologically, we first systematically characterize—through a pragmatic lens—three core failure modes of VLMs in REG; second, we introduce RefOI, the first bilingual (written and spoken) multimodal referring-expression dataset with fine-grained pragmatic annotations (1.5k images); third, we propose a Grice-theory–informed pragmatic evaluation framework integrating human judgment with granular error analysis. Experiments across prominent VLMs—including BLIP-2, LLaVA, and Qwen-VL—demonstrate widespread, severe violations of referential uniqueness, informational conciseness, and spatial cue minimization. Crucially, standard automatic metrics (e.g., CIDEr, SPICE) exhibit strong misalignment with human pragmatic judgments.

Technology Category

Application Category

📝 Abstract

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

Problem

Research questions and friction points this paper is trying to address.

VLMs fail to uniquely identify referents in REG tasks

VLMs include excessive irrelevant information in expressions

VLMs misalign with human pragmatic communication preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RefOI dataset with 1.5k images

Evaluates VLMs for pragmatic competence failures

Advocates pragmatically informed models and evaluations

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling