ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three critical challenges in applying speculative decoding (SD) to reinforcement learning (RL): (1) diminishing acceleration under large-batch training, (2) policy degradation due to lagging draft models, and (3) training instability. We propose a triple-cooperative optimization framework: (1) dynamic decoding configuration guided by real-time computational load and rollout quality; (2) online draft model updating via knowledge distillation, where the target policy serves as the teacher and rollouts are weighted by their reward estimates; and (3) reward-aware gradient weighting to mitigate policy divergence. Evaluated on Qwen models ranging from 3B to 14B parameters, our method achieves up to 4.5× inference speedup while preserving reward convergence and training stability. To the best of our knowledge, this is the first systematic solution enabling SD to robustly support iterative policy optimization scenarios such as RLHF.

Technology Category

Application Category

📝 Abstract
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
Problem

Research questions and friction points this paper is trying to address.

Optimizing speculative decoding for reinforcement learning systems' efficiency
Addressing speedup limitations and policy degradation in RL training
Maintaining training stability while accelerating large language model adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically tuning speculative decoding configurations
Evolving drafter via knowledge distillation technique
Weighting updates using rollout reward signals
🔎 Similar Papers
No similar papers found.
Qiaoling Chen
Qiaoling Chen
Nanyang Technology University
Zijun Liu
Zijun Liu
Tsinghua University
LLMAgentMachine TranslationAIGC
P
Peng Sun
Shanghai Qiji Zhifeng Co., Ltd., Shanghai, China
Shenggui Li
Shenggui Li
Nanyang Technological University
HPCMachine LearningComputer System
G
Guoteng Wang
Shanghai Qiji Zhifeng Co., Ltd., Shanghai, China
Z
Ziming Liu
National University of Singapore, Singapore
Yonggang Wen
Yonggang Wen
FIEEE, FSAEng, Professor & President's Chair, Nanyang Technological University Singapore
Data CenterDigital TwinMultimedia ComputingGreen Computing
S
Siyuan Feng
Shanghai Innovation Institute, Shanghai, China
T
Tianwei Zhang
Nanyang Technological University, Singapore