🤖 AI Summary
This work addresses the challenges of high cost and low efficiency in distributed rollouts during reinforcement learning post-training of large language models, which stem from wide-area coordination difficulties and policy distribution delays. The authors propose an architecture that integrates centralized learning with distributed rollouts, overlapping policy generation, distribution, and training while treating policy staleness as a controllable parameter to enable efficient and cost-effective training. Key innovations include a capacity model based on overlapping phases to guide resource allocation, a peer-assisted pipelined broadcast mechanism to alleviate distribution bottlenecks, and a heterogeneity-aware, cost-conscious activation strategy to improve resource utilization. Experiments on 4B and 8B models using GRPO post-training demonstrate significant improvements in cost efficiency under real-world wide-area network conditions, while maintaining reward performance comparable to strong baselines.
📝 Abstract
Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.