🤖 AI Summary
Large Vision-Language Models (LVLMs) frequently exhibit hallucinations when interpreting stylized images, severely limiting their reliability in high-stakes domains such as art education, medical image analysis, and game scene understanding. This work presents the first systematic investigation into the underlying mechanisms, revealing that hallucinations stem primarily from style-induced interference with early visual attention distributions. To address this, we propose Style-Aware Early Correction Mechanism (SA-ECM), a token-level attention feedback framework that dynamically recalibrates the visual encoding process during early transformer layers—enabling model-agnostic and task-agnostic hallucination mitigation. We introduce a novel dual-modality dataset comprising paired photographic and stylized images with fine-grained annotations. Evaluating 13 state-of-the-art LVLMs across discriminative and generative tasks, SA-ECM consistently outperforms existing baselines, achieving superior hallucination reduction across multiple models, datasets, and tasks—thereby significantly enhancing LVLM robustness and trustworthiness in safety-critical applications.
📝 Abstract
Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision SAVER, a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.