Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the prevalent issue of object hallucination in multimodal large language models, which often arises from overreliance on linguistic priors and insufficient introspective verification against visual evidence. The authors propose a training-free inference framework that introduces, for the first time, an attribution-based introspection mechanism inspired by metacognition. This mechanism detects hallucination risks by identifying conflicts between language and visual prediction probabilities and locates causal visual anchors accordingly. By integrating instance-level interpretable bidirectional causal guidance with adaptive confidence calibration, the framework dynamically adjusts the reasoning process to enable self-correction. Without modifying the underlying model, the method reduces hallucination rates by 12.67% on MMHal-Bench and improves accuracy by 5.8% on POPE, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

Problem

Research questions and friction points this paper is trying to address.

object hallucination

multimodal large language models

cognitive introspection

visual-linguistic alignment

overconfident generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Introspection

Object Hallucination

Bi-Causal Steering