ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses three critical challenges in applying speculative decoding (SD) to reinforcement learning (RL): (1) diminishing acceleration under large-batch training, (2) policy degradation due to lagging draft models, and (3) training instability. We propose a triple-cooperative optimization framework: (1) dynamic decoding configuration guided by real-time computational load and rollout quality; (2) online draft model updating via knowledge distillation, where the target policy serves as the teacher and rollouts are weighted by their reward estimates; and (3) reward-aware gradient weighting to mitigate policy divergence. Evaluated on Qwen models ranging from 3B to 14B parameters, our method achieves up to 4.5× inference speedup while preserving reward convergence and training stability. To the best of our knowledge, this is the first systematic solution enabling SD to robustly support iterative policy optimization scenarios such as RLHF.

Technology Category

Application Category

📝 Abstract

Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.

Problem

Research questions and friction points this paper is trying to address.

Optimizing speculative decoding for reinforcement learning systems' efficiency

Addressing speedup limitations and policy degradation in RL training

Maintaining training stability while accelerating large language model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically tuning speculative decoding configurations

Evolving drafter via knowledge distillation technique

Weighting updates using rollout reward signals

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL