LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from pervasive hallucination in real-world applications; while conventional wisdom attributes this to overly strong language priors, this work identifies *insufficient language prior strength* as the primary performance bottleneck. Method: We introduce LanP, the first fine-grained benchmark explicitly designed to evaluate language prior capability—comprising 170 images subjected to controlled visual degradation (occlusion/blurring) and 340 semantically adversarial questions—to systematically assess models’ reliance on and utility of language priors under partial visual information loss. We propose a “strength–utility” equilibrium framework, integrating multi-model consistency analysis and attribution-based evaluation. Results: Experiments across 25 state-of-the-art LVLMs—including GPT-4 Turbo—reveal that most models achieve accuracy below 0.5 under occlusion, confirming inadequate language prior strength and poor synergy between linguistic and visual signals for robust reasoning.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have shown impressive performance in various tasks. However, LVLMs suffer from hallucination, which hinders their adoption in the real world. Existing studies emphasized that the strong language priors of LVLMs can overpower visual information, causing hallucinations. However, the positive role of language priors is the key to a powerful LVLM. If the language priors are too weak, LVLMs will struggle to leverage rich parameter knowledge and instruction understanding abilities to complete tasks in challenging visual scenarios where visual information alone is insufficient. Therefore, we propose a benchmark called LanP to rethink the impact of Language Priors in LVLMs. It is designed to investigate how strong language priors are in current LVLMs. LanP consists of 170 images and 340 corresponding well-designed questions. Extensive experiments on 25 popular LVLMs reveal that many LVLMs' language priors are not strong enough to effectively aid question answering when objects are partially hidden. Many models, including GPT-4 Turbo, exhibit an accuracy below 0.5 in such a scenario.

Problem

Research questions and friction points this paper is trying to address.

Evaluates language priors in Large Vision-Language Models.

Addresses hallucination issues caused by strong language priors.

Assesses LVLMs' performance in challenging visual scenarios.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessing language priors strength

Designing benchmark LanP

Testing 25 LVLMs performance

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment