Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from initial and snowball hallucinations in visual question answering (VQA), rendering them vulnerable to anomalous tokens and prone to neglecting dense contextual information. To address this, we propose a causal attention decoding intervention mechanism: a learnable triangular causal mask and attention register, integrated with a decaying position-aware encoding scheme, dynamically recalibrates attention weights during decoding—suppressing noisy tokens while enhancing long-range contextual modeling. Our method requires no architectural modifications, is plug-and-play, and maintains full compatibility with mainstream MLLM frameworks. Evaluated on both image- and video-based VQA benchmarks, it significantly reduces hallucination rates while improving answer accuracy and robustness against input perturbations. This work establishes a novel paradigm for trustworthy decoding in multimodal generative tasks.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in multimodal large language models
Enhancing token interaction to reduce outlier reliance
Improving in-context inference with dynamic attention allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages causal masks for token interaction
Intervenes propagation to tackle outlier tokens
Uses attention register for dynamic allocation