Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing DPO methods rely on a single negative sample—typically generated via simple perturbations or similarity-based retrieval—failing to capture the semantic complexity of multimodal preferences and often inducing optimization bias and hallucination. To address this, we propose MISP-DPO, the first DPO framework for vision-language models that incorporates semantically diverse multi-negative contrastive learning. Our approach leverages CLIP embeddings and sparse autoencoders to identify semantic deviation dimensions, then jointly optimizes negative sampling based on reconstruction difficulty, positive–negative semantic divergence, and inter-negative diversity—enabled by Plackett–Luce ranking modeling and importance sampling. Evaluated on five benchmarks, MISP-DPO achieves significant improvements over state-of-the-art methods, demonstrating that semantic-aware multi-negative sampling is critical for enhancing both robustness and accuracy in multimodal preference alignment.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.

Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal DPO's oversimplified pairwise comparison limitations

Proposes multi-negative sampling using semantic diversity and Plackett-Luce model

Reduces optimization bias and hallucinations in vision-language preference learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multiple diverse negative images via Plackett-Luce model

Selects negatives by reconstruction difficulty and semantic deviation

Applies importance sampling strategy to improve training efficiency

🔎 Similar Papers

Negative Sampling in Recommendation: A Survey and Future Directions