π€ AI Summary
This work addresses the challenge that multimodal large language models are prone to overfitting during preference optimization due to imbalanced data difficulty, which limits their ability to suppress hallucinations. To mitigate this issue, the authors propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a framework that requires neither additional training nor new data. DA-DPO leverages a pretrained vision-language model to perform distribution-aware voting through generative and contrastive objectives, enabling efficient estimation of sample difficulty. This difficulty estimate is then used to reweight the optimization process, prioritizing learning from harder examples. Experiments demonstrate that DA-DPO significantly improves multimodal preference optimization across multiple standard benchmarks, enhancing the modelβs robustness against hallucinations and generalization capability while maintaining computational efficiency.
π Abstract
Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.