Offline Preference Optimization via Maximum Marginal Likelihood Estimation

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the complexity and instability of Reinforcement Learning from Human Feedback (RLHF) in aligning large language models (LLMs) with human preferences, this paper proposes an offline preference optimization framework based on Maximum Marginal Likelihood (MML) estimation. The method directly maximizes the marginal log-likelihood of preferred responses—eliminating the need for explicit reward modeling, policy-value decoupling, or entropy regularization—while naturally weighting preferred responses during gradient updates to implicitly achieve preference alignment. Theoretical analysis demonstrates its stability and preservation of pre-trained language modeling capabilities. Experiments across model scales (135M–8B parameters) show that our approach matches or surpasses state-of-the-art methods in alignment performance, exhibits superior robustness to the temperature hyperparameter β, and significantly outperforms baselines in retaining original language modeling competence.

Technology Category

Application Category

📝 Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter $β$ compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model's general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO's implicit preference optimization within the gradient updates.

Problem

Research questions and friction points this paper is trying to address.

Aligns LLMs with human preferences using Maximum Marginal Likelihood

Eliminates need for explicit reward models in preference optimization

Maintains language capabilities while achieving competitive preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum Marginal Likelihood for preference optimization

Eliminates explicit reward model requirement

Implicitly weights chosen responses over rejected

🔎 Similar Papers

Preference Elicitation for Offline Reinforcement Learning