VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification

📅 2025-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive visual hallucinations (VH) in image generation—producing outputs inconsistent with input images—while existing mitigation methods often compromise inference efficiency. This paper proposes VASparse, a plug-and-play decoding algorithm that suppresses unfaithful outputs in real time during inference, without additional training or post-processing. VASparse integrates vision-perception-driven token sparsification with contrastive decoding under sparsity constraints. It introduces an attention score recalibration mechanism to alleviate attention collapse induced by text bias, and designs a dynamic vision-guided pruning strategy to eliminate redundant re-decoding and backtracking. Evaluated on four major benchmarks, VASparse significantly reduces VH rates across diverse LVLMs—including Qwen-VL, LLaVA, and MiniGPT-4—while maintaining near-original decoding speed and achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance for mitigating VH while maintaining competitive decoding speed. Code is available at https://github.com/mengchuang123/VASparse-github.
Problem

Research questions and friction points this paper is trying to address.

Visual Hallucinations
Image Generation
Efficiency Sacrifice
Innovation

Methods, ideas, or system contributions that make the work stand out.

VASparse
Visual Artifact Reduction
Efficient Image Generation
🔎 Similar Papers
2024-10-06Conference on Empirical Methods in Natural Language ProcessingCitations: 33