🤖 AI Summary
DPO neglects the absolute magnitude of reward signals, leading to decreased selection probability for preferred responses and increased out-of-distribution generation risk—a phenomenon termed “Degenerated Choice Response (DCR).” To address this, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances optimization intensity between chosen and rejected responses—without altering the loss structure or imposing auxiliary constraints. BPO introduces two key innovations: balanced reward interval modeling and a gap-adaptive adapter, enabling dynamic reward scaling and gradient modulation. It integrates dynamic reward estimation, gradient self-adaptation, and preference alignment optimization, requiring only a single-line code integration. On mathematical reasoning benchmarks, BPO significantly outperforms DPO: accuracy improves by 10.1% on Llama-3.1-8B-Instruct (18.8% → 28.9%) and 11.7% on Qwen2.5-Math-7B (35.0% → 46.7%), consistently surpassing state-of-the-art variants including IPO, SLiC, and Cal-DPO.
📝 Abstract
Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.