🤖 AI Summary
Vision-language models (VLMs) suffer from hallucinations—spurious objects, attributes, or relations—due to overreliance on linguistic priors and inconsistent cross-modal representations. To address this, we propose a training-free, zero-overhead spectral representation filtering method: it identifies hallucination-dominant low-rank patterns via eigendecomposition of the feature-difference covariance matrix, then suppresses these components via soft spectral filtering applied directly to deep-layer pre-attention weights—enabling post-hoc representational calibration. Crucially, our approach modifies neither model architecture nor parameters and requires no fine-tuning. Evaluated on LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2 across MSCOCO and POPE-VQA benchmarks, it significantly reduces hallucination rates while achieving state-of-the-art faithfulness—without compromising generation quality.
📝 Abstract
Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model's representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.