🤖 AI Summary
To address the scalability limitations of traditional Evolution Strategies (ES) in large-scale neural network optimization—stemming from prohibitive computational and memory overhead—this paper proposes EGGROLL, an efficient ES variant based on low-rank learning. Methodologically, EGGROLL replaces full-rank parameter perturbations with randomized low-rank matrix perturbations, eliminating the need for backpropagation and achieving a theoretical convergence rate of O(1/r), where r is the rank. It further incorporates batched forward propagation, population-averaged updates, and integer quantization coupled with recurrent architecture adaptation. As a result, EGGROLL enables the first stable pretraining of purely integer-valued, nonlinear recurrent language models. Experiments demonstrate that EGGROLL matches GRPO’s performance in reinforcement learning tasks, maintains robustness under noisy and non-differentiable objectives, and effectively supports parallel optimization of billion-parameter models.
📝 Abstract
We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $Einmathbb{R}^{m imes n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbb{R}^{m imes r}, Bin mathbb{R}^{n imes r}$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A B^ op$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $mathcal{O}(mn)$ to $mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $mathcal{O}left(frac{1}{r}
ight)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.