SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and latency of language model inference in online policy reinforcement learning, which severely hampers training efficiency. To mitigate this, the authors propose a prompt-based tree-structured caching mechanism that organizes previously generated outputs into a prefix tree and enables efficient draft sampling during rollouts via speculative decoding. The approach further incorporates runtime cache updates and an idle-period lookahead pre-generation strategy to enhance cache coverage and reuse. Crucially, it preserves the correctness of the policy distribution while seamlessly integrating with mainstream RL algorithms such as PPO, GRPO, and DAPO. Evaluated on multi-turn dialogue and standard RL tasks, the method achieves up to 2.08× end-to-end training speedup, substantially reducing per-token inference cost and step latency.

Technology Category

Application Category

📝 Abstract
We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
language models
rollout acceleration
on-policy RL
inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Tree-Structured Cache
On-Policy Reinforcement Learning
Run-Ahead Generation
Language Model Acceleration
🔎 Similar Papers
No similar papers found.