When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of traditional speculative decoding—such as deployment latency, delayed feedback, and performance degradation under domain drift—stemming from the separation between training and serving. The authors propose Aurora, a system that formulates speculative decoding as an asynchronous reinforcement learning problem, enabling continuous optimization of the draft model through online reinforcement learning on real-time inference trajectories. By unifying training and serving into a closed-loop pipeline, Aurora integrates the SGLang inference server with an asynchronous training module, leveraging token acceptance and rejection as positive and negative reward signals for policy updates. This design supports zero-day deployment and hot updates. Evaluated on large models including MiniMax M2.1 229B, Aurora achieves a 1.5× speedup at zero-day deployment and provides an additional 1.25× acceleration over static speculative models under distribution shift, substantially improving adaptability and sample efficiency.

Technology Category

Application Category

📝 Abstract

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

deployment lag

domain drift

LLM serving

training-serving decoupling

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

reinforcement learning

unified training-serving