ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenges of high cost and low efficiency in distributed rollouts during reinforcement learning post-training of large language models, which stem from wide-area coordination difficulties and policy distribution delays. The authors propose an architecture that integrates centralized learning with distributed rollouts, overlapping policy generation, distribution, and training while treating policy staleness as a controllable parameter to enable efficient and cost-effective training. Key innovations include a capacity model based on overlapping phases to guide resource allocation, a peer-assisted pipelined broadcast mechanism to alleviate distribution bottlenecks, and a heterogeneity-aware, cost-conscious activation strategy to improve resource utilization. Experiments on 4B and 8B models using GRPO post-training demonstrate significant improvements in cost efficiency under real-world wide-area network conditions, while maintaining reward performance comparable to strong baselines.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

distributed rollout

policy dissemination

cost-efficiency

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed reinforcement learning

policy staleness

overlap-based capacity model