🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive object hallucination, primarily caused by spurious correlations between co-occurring objects in training data.
Method: This work pioneers the application of causal inference to this problem, formalizing hallucination mechanisms via a structured causal model. We propose a systematic counterfactual sample generation method and introduce Causal-HalBench—the first causal benchmark enabling quantitative evaluation of spurious correlation effects. Our evaluation framework integrates structural causal modeling, counterfactual reasoning, text-to-image generation, and domain-specific LVLMs.
Contribution/Results: Extensive experiments across state-of-the-art LVLMs demonstrate that all models exhibit significant susceptibility to spurious correlations. Crucially, causal interventions—grounded in counterfactual analysis—prove both effective and broadly applicable for hallucination detection and mitigation, establishing a principled foundation for robust LVLM development.
📝 Abstract
Large Vision-Language Models (LVLMs) often suffer from object hallucination, making erroneous judgments about the presence of objects in images. We propose this primar- ily stems from spurious correlations arising when models strongly associate highly co-occurring objects during train- ing, leading to hallucinated objects influenced by visual con- text. Current benchmarks mainly focus on hallucination de- tection but lack a formal characterization and quantitative evaluation of spurious correlations in LVLMs. To address this, we introduce causal analysis into the object recognition scenario of LVLMs, establishing a Structural Causal Model (SCM). Utilizing the language of causality, we formally de- fine spurious correlations arising from co-occurrence bias. To quantify the influence induced by these spurious correla- tions, we develop Causal-HalBench, a benchmark specifically constructed with counterfactual samples and integrated with comprehensive causal metrics designed to assess model ro- bustness against spurious correlations. Concurrently, we pro- pose an extensible pipeline for the construction of these coun- terfactual samples, leveraging the capabilities of proprietary LVLMs and Text-to-Image (T2I) models for their genera- tion. Our evaluations on mainstream LVLMs using Causal- HalBench demonstrate these models exhibit susceptibility to spurious correlations, albeit to varying extents.