🤖 AI Summary
This work addresses the hallucination problem in large vision-language models (LVLMs), which often arises from their overreliance on linguistic priors, leading to outputs inconsistent with visual input. To mitigate this, the authors propose a single-pass forward attention-based spatial contrastive guidance mechanism that concurrently constructs dual pathways—vision-language and language-only—within self-attention layers. By introducing contrastive suppression of dominant language priors, the method enhances both visual consistency and semantic fidelity in generated text. Furthermore, an orthogonal correction component is incorporated to eliminate approximation bias, achieving high performance with substantially reduced computational overhead. Experimental results demonstrate state-of-the-art performance on the CHAIR and POPE benchmarks, with up to a 2× reduction in inference latency compared to existing contrastive decoding approaches that require multiple forward passes.
📝 Abstract
Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.