Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study investigates whether vision–language models exhibit greater alignment with human sensitivity to lexical concreteness than text-only large language models, even when prompted solely with text. Leveraging architecturally matched Llama and Llama Vision models, the authors systematically analyze the impact of multimodal pretraining through three lenses: output behavior, representational geometry, and attention dynamics. The work reveals, for the first time, that vision–language models demonstrate a stronger grounding effect in perception—despite receiving no visual input during inference. Specifically, their internal representations manifest a distinct concreteness axis, their generated concreteness ratings align more closely with human judgments, and their attention patterns display higher context independence compared to their text-only counterparts.

Technology Category

Application Category

📝 Abstract

Do vision--language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm distributions, testing whether multimodal training yields more human-consistent judgments. Across benchmarks and scales, VLMs show larger gains on more concrete inputs, exhibit clearer concreteness-structured representations, produce ratings that better match human norms, and display systematically different attention patterns consistent with increased grounding.

Problem

Research questions and friction points this paper is trying to address.

concreteness

vision-language models

large language models

perceptual grounding

human-like sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

concreteness

vision-language models

perceptual grounding