SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of the rollout phase in reinforcement learning training for large language models, where generating long trajectories often consumes over 70% of total training time and severely hinders improvements in inference capabilities. To mitigate this bottleneck, the authors propose an online length-aware scheduling strategy that dynamically reorders samples by output length, prioritizes shorter sequences, and constructs an approximately on-policy micro-curriculum. The key innovations include the first successful co-optimization of large-scale batched rollouts and micro-curriculum learning, along with a stateful controller and caching mechanism to controllably regulate the degree of off-policyness. Experiments on LLaMA-3.1-8B and Qwen-2.5-32B demonstrate over 50% reduction in training bubbles and performance gains of 3.9%–18.4% under identical data budgets.

Technology Category

Application Category

📝 Abstract
Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
large language models
rollout efficiency
training bottleneck
long chain-of-thought
Innovation

Methods, ideas, or system contributions that make the work stand out.

SortedRL
length-aware scheduling
rollout efficiency
on-policy micro-curriculum
cache-based off-policy control
🔎 Similar Papers
No similar papers found.