ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of excessive GPU memory consumption during reinforcement learning fine-tuning of large language models, which hinders their deployment in resource-constrained settings. The authors propose ESSAM, a novel framework that, for the first time, integrates evolutionary strategies—a zeroth-order optimization method—with Sharpness-Aware Minimization (SAM) to enable full-parameter fine-tuning. This approach simultaneously reduces memory footprint and enhances model generalization. Evaluated on the GSM8K mathematical reasoning benchmark, ESSAM achieves an accuracy of 78.27%, outperforming PPO (77.72%) and approaching the performance of GRPO (78.34%), while reducing GPU memory usage by 18× compared to PPO and by 10× relative to GRPO.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27\% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72\% and is comparable to GRPO with an accuracy of 78.34\%, and even surpassing them on some models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Large Language Models
GPU Memory Efficiency
Mathematical Reasoning
Fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolution Strategies
Sharpness-Aware Maximization
Memory-Efficient Fine-Tuning
Reinforcement Learning
Large Language Models
🔎 Similar Papers
No similar papers found.