🤖 AI Summary
In distributed reinforcement learning, policy weight synchronization often becomes a scalability bottleneck due to high communication overhead, particularly in bandwidth-constrained or decentralized settings. This work empirically demonstrates for the first time that RL weight updates exhibit substantial sparsity—typically affecting less than 1% of parameters—at both step-level and multi-step granularities. Building on this insight, the authors propose PULSE, a lossless, fault-tolerant sparse synchronization mechanism that avoids floating-point drift. By precisely tracking per-element weight differences and employing lossless sparse encoding with index-value pair transmission, PULSE reduces communication volume by over two orders of magnitude—from 14 GB to 108 MB—while fully preserving training dynamics and performance. Consequently, the required synchronization bandwidth drops from 20 Gbit/s to just 0.2 Gbit/s.
📝 Abstract
Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.