🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating objects not present in the input image. This work is the first to localize the root cause of hallucination at the image token level, revealing that only ~1.5% of high-attention image tokens dominate hallucinatory generation. To address this, we propose EAZY: a zero-shot, training-free, and architecture-agnostic intervention method. EAZY automatically identifies hallucination-relevant tokens via attention analysis and unsupervised importance estimation, then applies adaptive zero-masking to them. Evaluated across multiple LVLM architectures and benchmark datasets, EAZY consistently mitigates hallucination without compromising original task performance. It improves unsupervised hallucination detection accuracy by 15%, demonstrating precise, lossless, and generalizable hallucination suppression.
📝 Abstract
Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals a striking finding: a small subset of image tokens with high attention scores are the primary drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models and datasets. Building on this insight, we introduce EAZY, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinatorY image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving 15% improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.