Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing speculative decoding methods suffer from a misalignment between their training objective—single-path greedy draft generation—and their inference strategy—multi-branch tree-based verification—limiting achievable speedup. This paper proposes Group Tree Optimization (GTO), the first framework to align training and decoding via a unified tree-structured policy. GTO introduces a Draft Tree Reward that models the expected accepted token count over the entire draft tree, incorporates grouped policy training, debiased normalized advantage estimation, and sequence-level PPO updates, while freezing the target (reference) model to ensure optimization stability. We theoretically establish that GTO’s objective is positively correlated with inference acceleration. Experiments across multiple LLMs and tasks demonstrate that GTO improves average acceptance length by 7.4% and achieves an additional 7.7% latency reduction over EAGLE-3, setting new state-of-the-art performance for speculative decoding.

Technology Category

Application Category

📝 Abstract

Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4% and yields an additional 7.7% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Aligns draft model training with tree-based decoding policy

Optimizes expected token acceptance length in speculative decoding

Improves inference speedup across multiple domains and model sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns training with decoding-time tree policy

Uses Draft Tree Reward for expected acceptance length

Employs group-based draft policy training for stability

🔎 Similar Papers

Cascade Speculative Drafting for Even Faster LLM Inference