🤖 AI Summary
Existing RLHF methods for self-play alignment suffer from two key limitations: either they exhibit slow average-iterate convergence—entailing high computational overhead—or they converge to the Nash equilibrium of a regularized game, thereby deviating from true human preferences. This paper introduces Magnetic Preference Optimization (MPO), the first method to achieve last-iterate linear convergence to the Nash equilibrium of the *original* preference game. MPO is built upon Magnetic Mirror Descent (MMD) and integrates seamlessly into standard RLHF pipelines—requiring no auxiliary reward modeling or policy caching—thus significantly reducing memory footprint and inference latency. We provide rigorous theoretical guarantees of its convergence and demonstrate empirically that MPO consistently outperforms state-of-the-art preference optimization baselines across multiple benchmarks. Our results validate self-play as a viable and effective pathway toward aligning models with authentic human preferences.
📝 Abstract
Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.