🤖 AI Summary
This work identifies, for the first time, the visual encoder—not the language model—as the root cause of object hallucination in large vision-language models (LVLMs), systematically characterizing three inherent deficiencies: statistical bias, intrinsic bias, and fragility. To address these, we propose SHIELD, a lightweight, training-free, and fine-tuning-free defense framework that applies three decoupled interventions: (i) visual token reweighting to mitigate statistical bias; (ii) noise-derived token injection to suppress intrinsic bias; and (iii) adversarial perturbation combined with contrastive decoding to enhance robustness. Evaluated across diverse LVLM architectures (e.g., LLaVA, Qwen-VL) and standard hallucination benchmarks (POPE, MME-Hallucination), SHIELD reduces hallucination rates by an average of 21.3% while preserving or even improving performance on downstream tasks. Crucially, it demonstrates strong cross-model generalization, requiring no architectural modification or parameter update.
📝 Abstract
Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.