🤖 AI Summary
This work addresses the prevalent issue of hallucination in multimodal large language models (MLLMs) caused by misalignment between visual and linguistic modalities. The authors propose the Dual-anchor Introspective Decoding (DaID) framework, which introduces, for the first time, a contrastive decoding mechanism grounded in visual attention distributions to dynamically calibrate text generation at the token level. Specifically, a Spotlight layer amplifies signals aligned with visual evidence, while a Shadow layer suppresses inertial biases stemming from textual priors. This fine-grained intervention effectively mitigates hallucinations across multiple benchmarks and state-of-the-art MLLMs, significantly reducing hallucination rates while simultaneously enhancing general reasoning capabilities.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model's internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.