Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently generate hallucinated text inconsistent with input images during multimodal understanding. While existing inference-time interventions mitigate hallucinations, they incur substantial computational latency. This paper proposes SPIN—a training-free, zero-overhead attention head suppression strategy. We first identify that hallucinations originate from dynamically varying subsets of attention heads across layers. Leveraging this insight, we design an image-guided selective suppression mechanism: for each textual token, we retain only the Top-K vision-relevant attention heads—selected based on their attention scores over image tokens—and suppress those with low visual relevance. SPIN is task-agnostic, requires no parameter updates, and introduces no throughput degradation. Evaluated on VQA and image captioning, SPIN reduces hallucination scores by up to 2.7× while preserving F1 performance and improving inference throughput by 1.8×.

Technology Category

Application Category

📝 Abstract

Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from"hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.

Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in vision-language models

Identifying attention heads causing hallucinations

Suppressing low-attention heads to improve performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided head suppression strategy

Selectively suppress low-attention heads

No compute or latency overhead

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models