MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing preference optimization methods—such as DPO and its variants—formulate preference learning as maximum likelihood estimation, oversimplifying response selection into binary classification and hindering effective integration of prior reward knowledge. To address this, we propose MaPPO, a preference optimization framework grounded in Maximum A Posteriori (MaP) estimation. MaPPO is the first method to explicitly incorporate prior reward signals into the preference objective function, generalizing DPO-based approaches without introducing additional hyperparameters or computational overhead. It maintains full compatibility with prominent variants—including SimPO, IPO, and CPO—and supports both offline and online training. Extensive evaluations on MT-Bench, AlpacaEval 2.0, and Arena-Hard demonstrate consistent alignment performance gains across diverse model sizes and families. These results empirically validate the efficacy and generalizability of prior reward guidance in preference optimization.

Technology Category

Application Category

📝 Abstract

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Incorporates prior reward knowledge into preference optimization

Generalizes DPO by using Maximum a Posteriori objective

Improves alignment without additional hyperparameters or efficiency loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates prior reward knowledge into optimization

Generalizes DPO with Maximum a Posteriori objective

Enhances alignment without additional hyperparameters

🔎 Similar Papers

No similar papers found.