🤖 AI Summary
This work addresses the limitations of existing large language model alignment methods, which rely on static absolute reward functions and are thus vulnerable to data scarcity, noise, and training instability. The authors propose a novel multi-agent alignment framework based on dynamic competition, eschewing the conventional Bradley-Terry model in favor of direct policy learning from pairwise win/loss signals. By integrating an Elo rating system for adaptive opponent selection, the approach enables curriculum learning and temperature-controlled sampling. Empirical evaluations on AlpacaEval 2.0 and MT-Bench demonstrate substantial performance gains over baseline methods, with a 4.5× improvement in robustness to noisy data, thereby validating the efficacy of dynamic opponent selection and purely win/loss-driven training.
📝 Abstract
Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods<static pairwise training<Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.