Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large vision-language models (LVLMs) frequently over-rely on linguistic priors (LP), undermining visual evidence utilization; existing input-output analyses fail to pinpoint when and how visual information influences internal generation. Method: We introduce the concept of “Visual Integration Points” (VIPs)—the first formal definition and localization of layers where visual features substantively integrate into linguistic representations. We propose a Total Visual Integration (TVI) estimator to quantify LP strength, combining inter-layer representation dynamics, chain-wise embedding contrast, and distance aggregation for fine-grained mechanistic probing. Contribution/Results: Evaluated across 54 model–dataset combinations, VIPs demonstrate broad applicability; TVI strongly predicts LP bias (r = −0.82). Our framework establishes a novel, interpretable assessment paradigm for trustworthy multimodal reasoning in LVLMs.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Analyzing language prior mechanisms in LVLMs through layer-wise embedding dynamics

Identifying critical layers where visual information reshapes model representations

Quantifying visual influence on responses to diagnose language prior strength

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrasting chain-of-embedding for layer-wise analysis

Identifying Visual Integration Point in model layers

Developing Total Visual Integration estimator for quantification

🔎 Similar Papers

No similar papers found.