BPO: Revisiting Preference Modeling in Direct Preference Optimization

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

DPO neglects the absolute magnitude of reward signals, leading to decreased selection probability for preferred responses and increased out-of-distribution generation risk—a phenomenon termed “Degenerated Choice Response (DCR).” To address this, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances optimization intensity between chosen and rejected responses—without altering the loss structure or imposing auxiliary constraints. BPO introduces two key innovations: balanced reward interval modeling and a gap-adaptive adapter, enabling dynamic reward scaling and gradient modulation. It integrates dynamic reward estimation, gradient self-adaptation, and preference alignment optimization, requiring only a single-line code integration. On mathematical reasoning benchmarks, BPO significantly outperforms DPO: accuracy improves by 10.1% on Llama-3.1-8B-Instruct (18.8% → 28.9%) and 11.7% on Qwen2.5-Math-7B (35.0% → 46.7%), consistently surpassing state-of-the-art variants including IPO, SLiC, and Cal-DPO.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.

Problem

Research questions and friction points this paper is trying to address.

Addresses Degraded Chosen Responses in DPO

Balances optimization of chosen and rejected responses

Improves model accuracy without additional constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced Preference Optimization (BPO) framework

Dynamic balance with reward margin

Single-line code modification compatibility

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization