🤖 AI Summary
Existing DPO methods rely on a single negative sample—typically generated via simple perturbations or similarity-based retrieval—failing to capture the semantic complexity of multimodal preferences and often inducing optimization bias and hallucination. To address this, we propose MISP-DPO, the first DPO framework for vision-language models that incorporates semantically diverse multi-negative contrastive learning. Our approach leverages CLIP embeddings and sparse autoencoders to identify semantic deviation dimensions, then jointly optimizes negative sampling based on reconstruction difficulty, positive–negative semantic divergence, and inter-negative diversity—enabled by Plackett–Luce ranking modeling and importance sampling. Evaluated on five benchmarks, MISP-DPO achieves significant improvements over state-of-the-art methods, demonstrating that semantic-aware multi-negative sampling is critical for enhancing both robustness and accuracy in multimodal preference alignment.
📝 Abstract
Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.