MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the high computational cost and redundant gradient signals inherent in existing multi-negative-sample preference optimization methods. Under the Plackett–Luce (PL) preference model, the authors propose a negative sampling strategy that actively selects a compact subset of diverse and information-complementary negative samples by maximizing the determinant of the Fisher information matrix. This approach enhances policy optimization efficiency and integrates seamlessly into the Direct Preference Optimization framework. Empirical results across four benchmark tasks and three model architectures demonstrate that the method achieves higher accuracy, improved Recall and NDCG, and stronger alignment performance using fewer negative samples, significantly outperforming current state-of-the-art approaches.

📝 Abstract

Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We introduce MASS-DPO, a multi-negative active sample selection method that derives a PL-specific Fisher-information objective for selecting compact, informative negative subsets within each prompt. The resulting log-determinant objective selects negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy. In practice, this favors negatives whose gradients cover different update directions, reducing redundant signal from near-duplicate candidates while preserving the most useful training information. Across four benchmarks spanning recommendation and multiple-choice QA and three model families, MASS-DPO consistently exceeds or matches existing methods in accuracy, improves Recall/NDCG and margin-based optimization dynamics, and delivers stronger alignment with substantially fewer negatives.

Problem

Research questions and friction points this paper is trying to address.

multi-negative preference optimization

Direct Preference Optimization

redundant gradients

sample selection

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-negative preference optimization

Active sample selection

Plackett–Luce model