🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit imbalanced responses to easy versus hard samples during preference alignment: overfitting on easily distinguishable instances while underfitting on challenging ones. To address this, we propose a dynamic, joint optimization framework that simultaneously perceives data hardness and model response—introducing the first dual-driven DPO variant grounded in *data hardness awareness* and *model response awareness*. Our method quantifies image-text matching difficulty and integrates model output confidence to construct a difficulty-adaptive dynamic weighting mechanism, enabling fine-grained alignment. Evaluated across five benchmarks, our approach significantly enhances reliability and generalization: DAMO-7B reduces response-level and mention-level hallucinations by 90.0% and 95.3%, respectively, on Object HalBench—outperforming GPT-4V.
📝 Abstract
Direct Preference Optimization (DPO) has shown effectiveness in aligning multi-modal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMO) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMO enables the model to effectively adapt to data with varying levels of hardness. Extensive experiments on five benchmarks demonstrate that DAMO not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object HalBench, our DAMO-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.