Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study investigates whether visual input genuinely enhances vision-language models’ understanding of concreteness and imageability in lexical decision tasks, with particular attention to the relevance of visual evidence. Leveraging human-annotated attribute ratings, the authors employ probing analyses, canonical correlation analysis (CCA), and attribution case studies to assess how model representations shift in the presence of real-image contexts and their sensitivity to irrelevant visual cues. The findings reveal that irrelevant images often induce representational drift and reduce the recoverability of target attributes. However, during inference, explicitly instructing the model to prioritize textual information significantly improves judgment accuracy on vulnerable samples and strengthens alignment with human judgments.

📝 Abstract

Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

concreteness

imagery

lexical judgments

visual context

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

concreteness

imagery