Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning (RL) post-training, the rollout phase—generating long trajectories token-by-token—incurs substantial computational overhead, with a small fraction of long sequences dominating overall latency. Method: We propose a distribution-aware speculative decoding framework: (i) an incremental suffix-tree-based drafter, built from historical rollout trajectories, enabling non-parametric, prompt-level pattern modeling; and (ii) a length-aware draft budget allocation strategy that prioritizes acceleration of longer sequences while maintaining high acceptance rates. Contribution/Results: Our method preserves the original model’s output distribution and exactly reproduces the baseline RL training curve. Experiments on mathematical and code reasoning tasks demonstrate up to 50% reduction in rollout time, significantly improving RL training efficiency without compromising learning quality.

Technology Category

Application Category

📝 Abstract
Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
Problem

Research questions and friction points this paper is trying to address.

Accelerates RL training rollout phase bottlenecked by long-tail token generation
Reduces wall-clock time dominated by small fraction of long trajectories
Maintains identical training quality while speeding up speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive nonparametric drafter using suffix trees
Length-aware speculation policy for long trajectories
Distribution-aware speculative decoding without altering outputs