Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

πŸ“… 2025-08-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address pervasive noise and conflicts in fine-grained, multi-dimensional preference data, this paper proposes a novel data selection paradigm grounded in Preference Divergence (PD)β€”the first theoretically grounded metric quantifying multi-dimensional preference conflicts. Leveraging PD, we formulate an optimal data selection principle that enhances robustness and efficiency at the data curation stage. Methodologically, we integrate PD estimation into the Direct Multi-Preference Optimization (DMPO) framework and jointly mitigate length bias, enabling efficient filtering and training without requiring full preference annotations. Evaluated on the UltraFeedback dataset, our approach achieves over 10% relative performance improvement compared to strong baselines. It effectively unlocks the value of fine-grained preference data, offering an interpretable and scalable, data-driven pathway for value alignment.

Technology Category

Application Category

πŸ“ Abstract
Aligning Large Language Models (LLMs) with diverse human values requires moving beyond a single holistic "better-than" preference criterion. While collecting fine-grained, aspect-specific preference data is more reliable and scalable, existing methods like Direct Preference Optimization (DPO) struggle with the severe noise and conflicts inherent in such aggregated datasets. In this paper, we tackle this challenge from a data-centric perspective. We first derive the Direct Multi-Preference Optimization (DMPO) objective, and uncover a key Preference Divergence (PD) term that quantifies inter-aspect preference conflicts. Instead of using this term for direct optimization, we leverage it to formulate a novel, theoretically-grounded data selection principle. Our principle advocates for selecting a subset of high-consensus data-identified by the most negative PD values-for efficient DPO training. We prove the optimality of this strategy by analyzing the loss bounds of the DMPO objective in the selection problem. To operationalize our approach, we introduce practical methods of PD term estimation and length bias mitigation, thereby proposing our PD selection method. Evaluation on the UltraFeedback dataset with three varying conflict levels shows that our simple yet effective strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle using aggregated preference signals, all while boosting training efficiency and obviating the need for intractable holistic preference annotating, unlocking the potential of robust LLM alignment via fine-grained preference signals.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with diverse human values beyond single preference criteria
Addressing noise and conflicts in fine-grained aspect-specific preference data
Improving training efficiency and robustness via high-consensus data selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Multi-Preference Optimization objective derivation
Preference Divergence term for data selection
High-consensus data subset selection strategy
J
Jia Zhang
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University
Y
Yao Liu
Algorithm Tech, Taobao & Tmall Group of Alibaba
C
Chen-Xi Zhang
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artificial Intelligence, Nanjing University
Y
Yi Liu
Algorithm Tech, Taobao & Tmall Group of Alibaba
Yi-Xuan Jin
Yi-Xuan Jin
Nanjing University
Machine Learning
Lan-Zhe Guo
Lan-Zhe Guo
LAMDA Group, Nanjing University
Machine Learning
Yu-Feng Li
Yu-Feng Li
Professor, Nanjing University
Machine Learning