🤖 AI Summary
This work addresses three key limitations of Direct Preference Optimization (DPO) in Reinforcement Learning from Human Feedback (RLHF): length bias, probability degradation, and memory inefficiency. To this end, we propose Length-Controlled Model-Free Preference Optimization (LMPO), a reference-free preference optimization framework. Its core contributions are: (1) a length-controllable margin loss within the Bradley–Terry framework that jointly optimizes response quality and token-length fidelity; (2) substitution of the explicit reference model with a unified upper bound derived from an auxiliary reference model, eliminating dependence on a fixed, parameterized reference; and (3) mean log-probability regularization to mitigate train–inference inconsistency. Extensive experiments across six conditional generation benchmarks using Mistral and LLaMA3 demonstrate that LMPO effectively suppresses probability degradation, achieves precise length control, and consistently outperforms DPO and other state-of-the-art methods.
📝 Abstract
Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. This loss function regulates response length while simultaneously widening the margin between preferred and rejected outputs. By doing so, it mitigates probability degradation for both accepted and discarded responses, addressing a significant limitation of existing methods. We evaluate LMPO against state-of-the-art preference optimization techniques on two open-ended large language models, Mistral and LLaMA3, across six conditional benchmarks. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches. The code is available at url{https://github.com/gengxuli/LMPO}.