Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play

📅 2024-10-31
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM reinforcement learning (RL) post-training methods are constrained by static prompt distributions and lack mechanisms for dynamic prompt evolution. This paper proposes EVA, an asymmetric self-play framework that models post-training as an infinite two-player game between a *Creator*—which dynamically generates information-rich prompts—and a *Solver*—which produces preference-aligned responses. EVA is the first method to enable adaptive prompt generation in both offline and online RL settings for language models. It introduces regret-based game signals to drive prompt evolution, integrates prompt policy gradient updates, and unifies Direct Preference Optimization (DPO) with Reinforcement Learning from Outcome Optimization (RLOO). Evaluated on Gemma-2-9B-IT, EVA achieves a 60.1% win rate on Arena-Hard DPO (+8.5% absolute gain) and 62.4% on RLOO (+9.8%), surpassing Claude-3-Opus and approaching Gemini-1.5-Pro—without requiring any human-crafted prompts.

Technology Category

Application Category

📝 Abstract
Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on challenging benchmarks, without any extra human prompts, e.g. it boosts the win-rate of gemma-2-9b-it on Arena-Hard by 51.6% ->60.1% for DPO and 52.6% ->62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger. Extensive experiments show eva can create effective RL curricula and is robust across ablations. We believe adaptively evolving prompts are key to designing the next-generation RL post-training scheme.
Problem

Research questions and friction points this paper is trying to address.

Overcoming fixed prompt limitations in RL post-training for LLMs
Enabling adaptive prompt creation in offline and online RL
Improving model alignment via asymmetric self-play with regret signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric self-play for adaptive prompt creation
Regret-based signals for strategic prompt sampling
Offline and online RL post-training integration
🔎 Similar Papers
No similar papers found.
Z
Ziyu Ye
Google DeepMind, The University of Chicago
Rishabh Agarwal
Rishabh Agarwal
Meta, ex DeepMind, Google Brain
Reinforcement LearningDeep LearningArtificial Intelligence
T
Tianqi Liu
Google DeepMind
Rishabh Joshi
Rishabh Joshi
Google Deepmind, ex Brain Team
Language Technologies
S
S. Velury
Google DeepMind
Q
Quoc Le
Google DeepMind
Q
Qijun Tan
Google DeepMind
Y
Yuan Liu
Google DeepMind