Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of high parallel inference latency, substantial memory overhead, poor reliability, and prohibitive cost in reinforcement learning (RL) post-training of large language models (LMs), this paper introduces SAPO—a decentralized asynchronous RL algorithm. SAPO enables scalable LM RL fine-tuning through distributed rollout experience sharing, decentralized policy updates, and a lightweight asynchronous communication protocol, supporting both collaborative and autonomous operation across heterogeneous hardware and model configurations. It constitutes the first fully decentralized, asynchronous framework for LM RL post-training, significantly reducing system coupling and infrastructure dependency. In controlled experiments, SAPO achieves a 94% increase in cumulative reward over baselines. Open-sourced large-scale evaluation on a thousand-node cluster demonstrates its strong scalability and robustness across diverse hardware platforms and model architectures.

Technology Category

Application Category

📝 Abstract
Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.
Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling RL post-training for language models
Reducing technical challenges in decentralized RL inference
Enabling collective experience sharing across heterogeneous nodes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized asynchronous RL algorithm for LM post-training
Shares rollouts across heterogeneous compute node networks
Eliminates latency and hardware homogeneity requirements
J
Jeffrey Amico
Gensyn AI Team
G
Gabriel Passamani Andrade
Gensyn AI Team
J
John Donaghy
Gensyn AI Team
Ben Fielding
Ben Fielding
Co-Founder of Gensyn
Deep LearningEvolutionary OptimisationPrivacy Preserving Machine LearningComputer Vision
T
Tristin Forbus
Gensyn AI Team
H
Harry Grieve
Gensyn AI Team
S
Semih Kara
Gensyn AI Team
J
Jari Kolehmainen
Gensyn AI Team
Y
Yihua Lou
Gensyn AI Team
C
Christopher Nies
Gensyn AI Team
E
Edward Phillip Flores Nuño
Gensyn AI Team
D
Diogo Ortega
Gensyn AI Team
S
Shikhar Rastogi
Gensyn AI Team
A
Austin Virts
Gensyn AI Team
Matthew J. Wright
Matthew J. Wright
Gensyn AI Team