Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the susceptibility of multimodal diffusion large language models (MDLLMs) to multimodal hallucinations during parallel masked decoding, which arises from relying solely on textual likelihood to rank candidate tokens without verifying local visual evidence. To mitigate this, the authors propose VISAGE, a training-free inference framework that calibrates the decoding objective and re-ranks tokens by leveraging cross-attention spatial entropy quantification and inter-head localization consensus, prioritizing outputs grounded in visual input. Notably, VISAGE conceptualizes hallucination as a local optimization error and achieves visual grounding through calibration of attention spatial distributions, with theoretical stability guarantees. Experiments demonstrate that VISAGE yields relative performance improvements of 8.59% on MMMU-val and 7.75% on HallusionBench, significantly enhancing the model’s robustness against hallucinations.

Technology Category

Application Category

📝 Abstract

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

Problem

Research questions and friction points this paper is trying to address.

multimodal hallucination

visual grounding

objective mismatch

masked decoding

cross-attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal hallucination

visual grounding

cross-attention entropy