🤖 AI Summary
Existing LLM reinforcement learning (RL) post-training methods are constrained by static prompt distributions and lack mechanisms for dynamic prompt evolution. This paper proposes EVA, an asymmetric self-play framework that models post-training as an infinite two-player game between a *Creator*—which dynamically generates information-rich prompts—and a *Solver*—which produces preference-aligned responses. EVA is the first method to enable adaptive prompt generation in both offline and online RL settings for language models. It introduces regret-based game signals to drive prompt evolution, integrates prompt policy gradient updates, and unifies Direct Preference Optimization (DPO) with Reinforcement Learning from Outcome Optimization (RLOO). Evaluated on Gemma-2-9B-IT, EVA achieves a 60.1% win rate on Arena-Hard DPO (+8.5% absolute gain) and 62.4% on RLOO (+9.8%), surpassing Claude-3-Opus and approaching Gemini-1.5-Pro—without requiring any human-crafted prompts.
📝 Abstract
Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on challenging benchmarks, without any extra human prompts, e.g. it boosts the win-rate of gemma-2-9b-it on Arena-Hard by 51.6% ->60.1% for DPO and 52.6% ->62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger. Extensive experiments show eva can create effective RL curricula and is robust across ablations. We believe adaptively evolving prompts are key to designing the next-generation RL post-training scheme.