On the Role of Preference Variance in Preference Optimization

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the high cost of human preference annotation in Direct Preference Optimization (DPO), this paper introduces Preference Variance (PVar), a novel metric quantifying model uncertainty over response pairs. We theoretically establish an upper bound linking PVar to the DPO gradient norm, demonstrating that samples with higher PVar yield greater learning signal. Leveraging this insight, we propose a PVar-guided sample selection strategy: DPO fine-tuning is performed exclusively on the top 10% highest-PVar prompts—identified using a reward model—bypassing costly human annotations. Experiments show that our method significantly outperforms full-data DPO on AlpacaEval 2.0 and Arena-Hard, while remaining robust even with small-scale reward models. Crucially, this work provides the first theoretical grounding and practical framework for PVar, establishing a new paradigm for efficient large language model alignment with substantially reduced annotation overhead.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of emph{preference variance} (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.

Problem

Research questions and friction points this paper is trying to address.

Investigating how preference variance affects Direct Preference Optimization training effectiveness

Establishing theoretical bounds showing low variance prompts produce small gradient updates

Demonstrating high variance prompts yield better performance than random selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing preference variance impact on DPO training

Establishing theoretical bound for DPO gradient norm

Selecting high variance prompts for efficient alignment

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization