Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from pervasive visual-textual inconsistency hallucinations. Existing mitigation approaches rely on costly human annotations or auxiliary models to construct preference data, limiting scalability and practicality. This paper proposes APASI—a self-injected, supervision-free hallucination mitigation framework. Its core innovation lies in leveraging the LVLM itself to generate high-quality negative samples that conform to authentic hallucination patterns, thereby constructing unsupervised preference pairs; it further employs iterative curriculum-aligned training to progressively enhance the model’s discriminative and generative capabilities for visual consistency. Evaluated across six benchmarks, APASI significantly reduces hallucination rates across three representative LVLMs, achieving performance on par with—or even surpassing—that of supervised alignment methods. These results demonstrate APASI’s effectiveness, generalizability, and potential for sustainable optimization.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) suffer from serious hallucination problems, where the model-generated responses are inconsistent with the visual inputs. Existing hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection, which increase costs and limit sustainable improvement. To tackle these challenges, we propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies. APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels. During the self-injection process, the dis-preferred response is generated based on three key observations of hallucinations, ensuring it simulates real hallucination patterns. This fidelity offers an accurate learning signal for hallucination mitigation. Moreover, APASI incorporates an iterative alignment training strategy combined with curriculum learning to periodically update the preference data with increasing challenge, enabling stable and continuous enhancement of the LVLM. Extensive experiments across six benchmarks show that APASI not only effectively mitigates hallucinations for three baseline models but also achieves comparable or even superior performance to alignment-based methods with external dependency, thereby demonstrating its effectiveness and generalization capability. The code is available at https://github.com/davidluciolu/APASI.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in vision-language models
Reducing dependency on external annotation resources
Self-generating preference data for alignment training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-injecting hallucinations for autonomous preference alignment
Iterative alignment training with curriculum learning strategy
Generating dis-preferred responses using key hallucination observations
🔎 Similar Papers
No similar papers found.
Y
Yifan Lu
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Ziqi Zhang
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences
Chunfeng Yuan
Chunfeng Yuan
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
computer visionPattern RecognitionMachine LearningHuman Action RecognitionSparse Representation
J
Jun Gao
Hello Group
C
Congxuan Zhang
Nanchang Hangkong University
Xiaojuan Qi
Xiaojuan Qi
Assistant Professor, The University of Hong Kong
3D VisionDeep learningArtificial IntelligenceMedical Image Analysis
B
Bing Li
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences
W
Weiming Hu
Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Information Science and Technology, ShanghaiTech University