MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of diverse and conflicting human preferences in Reinforcement Learning from Human Feedback (RLHF). We propose Mixing Preference Optimization (MPO), an efficient and stable post-hoc alignment framework. MPO fuses multiple single-objective alignment strategies via log-linear weighting, circumventing biases, high computational costs, and training instability inherent in reliance on a single reward model. Its key contribution is the first application of batched stochastic mirror descent to learn interpretable, retraining-free policy weights for aggregating heterogeneous preference strategies. On multi-preference benchmarks, MPO achieves balanced performance—matching or exceeding MORLHF and MaxMin-RLHF—while significantly reducing training overhead and enhancing convergence stability.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
Problem

Research questions and friction points this paper is trying to address.

Aligning large language models with human preferences
Reducing computational costs in reinforcement learning
Aggregating diverse single-objective policies efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixing Preference Optimization framework
Aggregates single-objective policies log-linearly
Uses batch stochastic mirror descent
🔎 Similar Papers
No similar papers found.