Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) frequently generate hallucinated outputs during multimodal reasoning, undermining reliability in vision-language understanding. Method: This paper proposes the On-Policy Alignment (OPA)-DPO framework, which (i) theoretically identifies KL divergence from off-policy preference data as a key cause of DPO performance degradation and enforces strict on-policy consistency between preference data and the initial policy; and (ii) introduces an expert-feedback-driven response correction mechanism to jointly align both original and revised model outputs. Contribution/Results: Theoretical analysis and empirical evaluation demonstrate that OPA-DPO achieves a 13.26% and 5.39% reduction in hallucination rates for LLaVA-1.5-7B on the AMBER and Object-Hal benchmarks, respectively—using only 4.8K training samples. This outperforms state-of-the-art methods trained on 16K samples, confirming the critical role of on-policy alignment in mitigating LVLM hallucinations.

Technology Category

Application Category

📝 Abstract
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
Problem

Research questions and friction points this paper is trying to address.

Large Visual Language Models
Complex Image-Captioning
Hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

OPA-DPO Framework
Expert Feedback Integration
Visual Language Model Enhancement
🔎 Similar Papers
No similar papers found.
Zhihe Yang
Zhihe Yang
The Chinese University of Hong Kong
Offline RLRLHFLLMLMM
X
Xufang Luo
Microsoft Research Asia, Shanghai, China
D
Dongqi Han
Microsoft Research Asia, Shanghai, China
Yunjian Xu
Yunjian Xu
The Chinese University of Hong Kong
Deep reinforcement learningPower systemsElectricity marketsOperations ResearchStochastic optimal control
D
Dongsheng Li
Microsoft Research Asia, Shanghai, China