Evolution Strategies at the Hyperscale

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the scalability limitations of traditional Evolution Strategies (ES) in large-scale neural network optimization—stemming from prohibitive computational and memory overhead—this paper proposes EGGROLL, an efficient ES variant based on low-rank learning. Methodologically, EGGROLL replaces full-rank parameter perturbations with randomized low-rank matrix perturbations, eliminating the need for backpropagation and achieving a theoretical convergence rate of O(1/r), where r is the rank. It further incorporates batched forward propagation, population-averaged updates, and integer quantization coupled with recurrent architecture adaptation. As a result, EGGROLL enables the first stable pretraining of purely integer-valued, nonlinear recurrent language models. Experiments demonstrate that EGGROLL matches GRPO’s performance in reinforcement learning tasks, maintains robustness under noisy and non-differentiable objectives, and effectively supports parallel optimization of billion-parameter models.

Technology Category

Application Category

📝 Abstract

We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $Einmathbb{R}^{m imes n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbb{R}^{m imes r}, Bin mathbb{R}^{n imes r}$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A B^ op$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $mathcal{O}(mn)$ to $mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $mathcal{O}left(frac{1}{r} ight)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.

Problem

Research questions and friction points this paper is trying to address.

Scaling evolution strategies for large neural networks with billions of parameters

Reducing computational costs of matrix perturbations in evolution strategies

Overcoming memory limitations in backprop-free optimization methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses low-rank matrix perturbations to reduce computational costs

Replaces full-rank perturbations with low-rank approximations for efficiency

Achieves high-rank updates through population averaging in evolution strategies

🔎 Similar Papers

No similar papers found.