🤖 AI Summary
This work addresses the prevalent issue of object hallucination in multimodal large language models, where spurious cross-modal dependencies lead to the generation of objects not present in the input image, thereby compromising output reliability. To mitigate this, the authors propose a causal decoding framework that, for the first time, integrates causal intervention directly into the multimodal decoding process. By dynamically adjusting token probabilities during generation, the method intervenes in the intrinsic mechanism underlying hallucination, eschewing heuristic penalties or post-hoc corrections. Evaluated across multiple image captioning and visual question answering benchmarks, the approach significantly reduces object hallucination rates while achieving state-of-the-art generation faithfulness, all without sacrificing linguistic fluency or overall output quality.
📝 Abstract
Multimodal Large Language Models (MLLMs) deliver detailed responses on vision-language tasks, yet remain susceptible to object hallucination (introducing objects not present in the image), undermining reliability in practice. Prior efforts often rely on heuristic penalties, post-hoc correction, or generic decoding tweaks, which do not directly intervene in the mechanisms that trigger object hallucination and thus yield limited gains. To address this challenge, we propose a causal decoding framework that applies targeted causal interventions during generation to curb spurious object mentions. By reshaping the decoding dynamics to attenuate spurious dependencies, our approach reduces false object tokens while maintaining descriptive quality. Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.