Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Models Alignment

📅 2024-10-22
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLHF methods for self-play alignment suffer from two key limitations: either they exhibit slow average-iterate convergence—entailing high computational overhead—or they converge to the Nash equilibrium of a regularized game, thereby deviating from true human preferences. This paper introduces Magnetic Preference Optimization (MPO), the first method to achieve last-iterate linear convergence to the Nash equilibrium of the *original* preference game. MPO is built upon Magnetic Mirror Descent (MMD) and integrates seamlessly into standard RLHF pipelines—requiring no auxiliary reward modeling or policy caching—thus significantly reducing memory footprint and inference latency. We provide rigorous theoretical guarantees of its convergence and demonstrate empirically that MPO consistently outperforms state-of-the-art preference optimization baselines across multiple benchmarks. Our results validate self-play as a viable and effective pathway toward aligning models with authentic human preferences.

Technology Category

Application Category

📝 Abstract
Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.
Problem

Research questions and friction points this paper is trying to address.

Achieves last-iterate convergence for language model alignment
Overcomes limitations of traditional Bradley-Terry model assumptions
Ensures Nash equilibrium reflects true human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Magnetic Preference Optimization for last-iterate convergence
Linear convergence rate via Magnetic Mirror Descent
Simple effective RLHF adaptation for LLM fine-tuning
🔎 Similar Papers
M
Mingzhi Wang
Institute for Artificial Intelligence, Peking University
Chengdong Ma
Chengdong Ma
Peking University
Reinforcement LearningMulti-Agent Systems
Qizhi Chen
Qizhi Chen
PhD Candidate of Zhejiang University
Multimodal ReasoningEmbodied AI3D Vision
L
Linjian Meng
National Key Laboratory for Novel Software Technology, Nanjing University
Y
Yang Han
China Telecom
Jiancong Xiao
Jiancong Xiao
University of Pennsylvania
Learning TheoryStatisticsOptimization
Zhaowei Zhang
Zhaowei Zhang
Peking University
AI GovernanceAI AlignmentGame TheoryHuman-AI Collaboration
Jing Huo
Jing Huo
Nanjing University
Machine LearningComputer Vision
W
Weijie J. Su
University of Pennsylvania
Y
Yaodong Yang
Institute for Artificial Intelligence, Peking University