Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address the low sample efficiency, training instability, and reward hacking prevalent in reinforcement learning (RL)-based fine-tuning of large language models (LLMs), this paper proposes the first distributed evolutionary strategy (ES) framework designed for full-parameter fine-tuning of billion-scale LLMs. Departing from gradient backpropagation and value modeling, our method estimates gradient directions via noise perturbation and leverages massive parallel evaluation for efficient optimization search. Experiments demonstrate that our approach significantly outperforms mainstream RL methods—such as PPO—in both sample efficiency and training stability, achieving superior performance across multiple tasks. Moreover, it exhibits enhanced cross-model robustness, improved adaptability to long-horizon rewards, and greater training consistency. By eliminating reliance on RL-specific components (e.g., critics, advantage estimation), our framework establishes a scalable, RL-free paradigm for LLM alignment.

Technology Category

Application Category

📝 Abstract

Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning method, contributing to the birth of many state-of-the-art LLMs. In contrast, evolution strategies (ES), which once showed comparable performance to RL on models with a few million parameters, was neglected due to the pessimistic perception of its scalability to larger models. In this work, we report the first successful attempt to scale up ES for fine-tuning the full parameters of LLMs, showing the surprising fact that ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects, including sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs. It therefore serves as a basis to unlock a new direction in LLM fine-tuning beyond what current RL techniques provide. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

Problem

Research questions and friction points this paper is trying to address.

Scaling evolution strategies for billion-parameter LLM fine-tuning

Overcoming scalability limitations of ES compared to reinforcement learning

Providing efficient stable alternative to RL for LLM optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaled evolution strategies for billion-parameter LLM fine-tuning

Outperformed reinforcement learning in efficiency and stability

Enabled robust, reward-hacking-resistant LLM optimization beyond RL

🔎 Similar Papers

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning