Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency arising from serial, co-located inference and training on the same cluster during RL-based post-training of LLMs—which violates the SPMD assumption and underutilizes resources—this paper proposes a fully decoupled heterogeneous distributed architecture: trajectory sampling is offloaded to edge devices, while policy optimization remains centralized in the training cluster. We design two lightweight synchronization protocols—hybrid pull-push coordination and version-tagged streaming rollout transmission—and introduce a replay buffer to enable asynchronous communication, balancing low-bias weight updates with high-throughput rollout distribution. The architecture natively supports hybrid edge–datacenter deployment. Empirical evaluation on the Qwen family of models demonstrates convergence speed and final reward comparable to fully co-located systems, thereby establishing, for the first time, the feasibility and efficiency of leveraging edge devices for large-scale rollout generation in LLM reinforcement learning.

Technology Category

Application Category

📝 Abstract
Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today's distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes sampler weights on every API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.
Problem

Research questions and friction points this paper is trying to address.

Decouples RL inference and training on heterogeneous swarms
Addresses inefficiency in serial context switching for LLMs
Enables decentralized RL alignment using edge hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples inference and training on heterogeneous swarms
Uses sequential pull and asynchronous push-pull protocols
Achieves datacentre-grade performance with decentralised resources