Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

📅 2024-06-30
🏛️ arXiv.org
📈 Citations: 10
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the limitation of the Bradley–Terry reward assumption in RLHF, which fails to capture the complexity of human preferences. We propose a novel game-theoretic paradigm for LLM alignment: preference learning is formulated as a two-player zero-sum game, with the Nash equilibrium strategy serving as the alignment objective. To our knowledge, this is the first application of no-regret online learning—specifically the Hedge algorithm—to RLHF, circumventing explicit win-rate estimation. We introduce a new preference loss function with theoretical guarantees of convergence to the Nash equilibrium. Our method integrates policy self-play, online optimization, and preference-data-driven fine-tuning. Evaluated on LLaMA-3-8B, it achieves 42.6% win rate on AlpacaEval 2.0 (length-controlled) and 37.8% on Arena-Hard—substantially outperforming existing online RLHF approaches.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with general human preferences
Overcoming limitations of reward-based RLHF approaches
Reducing computational and annotation costs in RLHF
Innovation

Methods, ideas, or system contributions that make the work stand out.

Game-theoretic approach for RLHF alignment
Iterative Nash policy optimization algorithm
Direct loss minimization on preference datasets
🔎 Similar Papers
No similar papers found.