🤖 AI Summary
This work addresses the challenge of aligning large language models (LLMs) with human preferences in realistic multi-turn dialogues. We observe that single-step methods such as DPO rely on the Bradley–Terry assumption, which fails to capture the inherent non-transitivity of human preferences. To overcome this limitation, we formulate RLHF as a two-player constant-sum Markov game—enabling explicit modeling of multi-step interactions for the first time. We propose Multi-step Policy Optimization (MPO) and its theoretically strengthened variant, Optimistic MPO (OMPO), both proven to converge to an ε-approximate Nash equilibrium at rate O(ε⁻¹). Leveraging a natural actor-critic framework and optimistic online gradient descent, our approach explicitly discards the transitivity assumption. Empirical evaluation on multi-turn dialogue and mathematical reasoning tasks demonstrates significant improvements over DPO and other baselines, validating the effectiveness, convergence stability, and practical generalizability of multi-step game-theoretic modeling.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Multi-step Preference Optimization (MPO) is built upon the natural actor-critic framework~citep{peters2008natural}. We further develop OMPO based on the optimistic online gradient descent algorithm~citep{rakhlin2013online,joulani17a}. Theoretically, we provide a rigorous analysis for both algorithms on convergence and show that OMPO requires $mathcal{O}(epsilon^{-1})$ policy updates to converge to an $epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.