🤖 AI Summary
This work addresses three critical challenges in applying speculative decoding (SD) to reinforcement learning (RL): (1) diminishing acceleration under large-batch training, (2) policy degradation due to lagging draft models, and (3) training instability. We propose a triple-cooperative optimization framework: (1) dynamic decoding configuration guided by real-time computational load and rollout quality; (2) online draft model updating via knowledge distillation, where the target policy serves as the teacher and rollouts are weighted by their reward estimates; and (3) reward-aware gradient weighting to mitigate policy divergence. Evaluated on Qwen models ranging from 3B to 14B parameters, our method achieves up to 4.5× inference speedup while preserving reward convergence and training stability. To the best of our knowledge, this is the first systematic solution enabling SD to robustly support iterative policy optimization scenarios such as RLHF.
📝 Abstract
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation.
To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.