🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive hallucination in real-world applications; while conventional wisdom attributes this to overly strong language priors, this work identifies *insufficient language prior strength* as the primary performance bottleneck. Method: We introduce LanP, the first fine-grained benchmark explicitly designed to evaluate language prior capability—comprising 170 images subjected to controlled visual degradation (occlusion/blurring) and 340 semantically adversarial questions—to systematically assess models’ reliance on and utility of language priors under partial visual information loss. We propose a “strength–utility” equilibrium framework, integrating multi-model consistency analysis and attribution-based evaluation. Results: Experiments across 25 state-of-the-art LVLMs—including GPT-4 Turbo—reveal that most models achieve accuracy below 0.5 under occlusion, confirming inadequate language prior strength and poor synergy between linguistic and visual signals for robust reasoning.
📝 Abstract
Large Vision-Language Models (LVLMs) have shown impressive performance in various tasks. However, LVLMs suffer from hallucination, which hinders their adoption in the real world. Existing studies emphasized that the strong language priors of LVLMs can overpower visual information, causing hallucinations. However, the positive role of language priors is the key to a powerful LVLM. If the language priors are too weak, LVLMs will struggle to leverage rich parameter knowledge and instruction understanding abilities to complete tasks in challenging visual scenarios where visual information alone is insufficient. Therefore, we propose a benchmark called LanP to rethink the impact of Language Priors in LVLMs. It is designed to investigate how strong language priors are in current LVLMs. LanP consists of 170 images and 340 corresponding well-designed questions. Extensive experiments on 25 popular LVLMs reveal that many LVLMs' language priors are not strong enough to effectively aid question answering when objects are partially hidden. Many models, including GPT-4 Turbo, exhibit an accuracy below 0.5 in such a scenario.