🤖 AI Summary
Multimodal large language models (MLLMs) frequently suffer from object hallucination—generating objects absent from the input—due to spurious co-occurrence biases in training data. This work pioneers the application of causal inference to mitigate multimodal hallucination, proposing a causal disentanglement framework. It introduces a causally driven projector for the visual pathway to achieve cross-modal representation disentanglement and integrates a do-calculus-based counterfactual intervention module with structured attention masking at the language model’s final layer to eliminate spurious correlations. By targeting the root cause—data-induced erroneous activation—the method suppresses hallucination at its origin. Evaluated across multiple benchmarks, it reduces average hallucination rates by 38.7% while preserving or improving performance on downstream tasks such as visual question answering and image captioning. Visualization analyses confirm significantly enhanced separability of object representations.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks, yet they often suffer from object hallucinations--generating descriptions of objects that are inconsistent with or entirely absent from the input. This issue is closely related to dataset biases, where frequent co-occurrences of objects lead to entangled semantic representations across modalities. As a result, models may erroneously activate object representations that are commonly associated with the input but not actually present. To address this, we propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention. Our approach includes a Causal-Driven Projector in the visual pathway and a Causal Intervention Module integrated into the final transformer layer of the language model. These components work together to reduce spurious correlations caused by biased training data. Experimental results show that our method significantly reduces hallucinations while maintaining strong performance on multiple multimodal benchmarks. Visualization analyses further confirm improved separability of object representations. The code is available at: https://github.com/IgniSavium/Causal-LLaVA