Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the long-tail distribution of response lengths in reinforcement learning (RL) training for large language models (LLMs)—where rare, lengthy responses severely degrade training throughput and waste GPU resources—this paper proposes a lossless acceleration system. Our method introduces three key innovations: (1) lightweight, continuous training of an adaptive drafter model on idle GPUs, enabling zero-overhead model alignment; (2) adaptive CUDA Graph pre-capturing coupled with a memory-efficient Graph pool to enhance speculative decoding stability and throughput; and (3) an adaptive rollout engine that dynamically matches draft quality to sequence length. Evaluated end-to-end, our system achieves over 1.7× training speedup with zero accuracy degradation. Moreover, it concurrently produces a high-quality, production-ready drafter model suitable for direct deployment.

Technology Category

Application Category

📝 Abstract

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.

Problem

Research questions and friction points this paper is trying to address.

Accelerates reasoning RL training by overcoming long-tail response inefficiency

Maintains model accuracy while reducing computational costs and resource waste

Adapts speculative decoding to dynamic RL workloads and evolving target models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Drafter trains lightweight model during idle GPU time

Adaptive Rollout Engine uses CUDAGraphs for memory-efficient execution

Speculative decoding accelerates RL training while preserving accuracy

🔎 Similar Papers

Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework