Treble Counterfactual VLMs: A Causal Approach to Hallucination

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address visual-textual inconsistency hallucinations in vision-language models (VLMs) arising from modality-specific shortcuts—particularly critical in safety-sensitive applications—this paper proposes the first systematic solution grounded in structural causal modeling (SCM) and counterfactual analysis. We construct a multimodal causal graph, quantify the natural direct effect (NDE) to measure shortcut influence, and design a triple-counterfactual framework to identify and block unimodal (visual or textual) shortcuts, enabling interpretable and intervenable hallucination suppression. Furthermore, we introduce a test-time dynamic modality-dependency regulation module that enforces strict reliance on genuine cross-modal fusion. Evaluated across multiple benchmarks, our method reduces hallucination rates by 38.2% on average while preserving VQA accuracy with negligible fluctuation (<0.5%). The implementation is open-sourced and supports plug-and-play deployment.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/TREE985/Treble-Counterfactual-VLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses hallucination in Vision-Language Models (VLMs).

Analyzes unintended direct influences of vision or text modalities.

Mitigates spurious modality shortcuts to improve VLM reliability.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal graph analysis for VLM hallucination mitigation

Counterfactual estimation of modality-specific direct effects

Dynamic test-time intervention for multi-modal fusion

🔎 Similar Papers

No similar papers found.