🤖 AI Summary
This work proposes AlphaEvolve, the first large language model–based evolutionary coding framework for automatically discovering novel multi-agent reinforcement learning algorithms, addressing the inefficiency of manual trial-and-error in navigating the vast algorithmic design space. AlphaEvolve systematically explores variants of Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO), uncovering non-intuitive mechanisms such as volatility-adaptive discounting, hybrid meta-solvers, and dynamic annealing. The framework yields two new algorithms—VAD-CFR and SHOR-PSRO—which consistently outperform state-of-the-art methods across multiple game-theoretic benchmarks, achieving both faster convergence and superior performance.
📝 Abstract
Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.