🤖 AI Summary
Large vision-language models (LVLMs) frequently generate hallucinated outputs during multimodal reasoning, undermining reliability in vision-language understanding.
Method: This paper proposes the On-Policy Alignment (OPA)-DPO framework, which (i) theoretically identifies KL divergence from off-policy preference data as a key cause of DPO performance degradation and enforces strict on-policy consistency between preference data and the initial policy; and (ii) introduces an expert-feedback-driven response correction mechanism to jointly align both original and revised model outputs.
Contribution/Results: Theoretical analysis and empirical evaluation demonstrate that OPA-DPO achieves a 13.26% and 5.39% reduction in hallucination rates for LLaVA-1.5-7B on the AMBER and Object-Hal benchmarks, respectively—using only 4.8K training samples. This outperforms state-of-the-art methods trained on 16K samples, confirming the critical role of on-policy alignment in mitigating LVLM hallucinations.
📝 Abstract
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.