DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the rollout efficiency bottleneck in reinforcement learning training of large language models, where long-tailed trajectories consume 50–80% of total training time. The authors propose a dynamic asynchronous rollout mechanism that co-designs algorithmic and system-level components to introduce the first multi-version streaming rollout paradigm. This approach enables efficient overlap between generation and training while preserving policy consistency, data integrity, and bounded latency. By integrating concurrent multi-version policy management, asynchronous pipeline scheduling, and system-level optimizations, the method seamlessly scales to complex architectures such as Mixture-of-Experts, eliminating training bubbles without compromising algorithmic correctness. Experiments demonstrate 2–3× throughput gains on open-source benchmarks and 2–4× acceleration in industrial-scale training across ten thousand GPUs, with the resulting LongCat-Flash-Thinking model achieving state-of-the-art performance on complex reasoning tasks.

📝 Abstract

Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently -- simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput -- up to 2--3 times higher than state-of-the-art systems on open-source benchmarks -- without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2--4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.

Problem

Research questions and friction points this paper is trying to address.

asynchronous reinforcement learning

long-tailed trajectories

rollout bottleneck

algorithmic correctness

Mix-of-Experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous reinforcement learning

multi-version streaming rollout

algorithm-system co-design