Scaling Laws for Speculative Decoding

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Inference-intensive large language models (e.g., OpenAI-o3, DeepSeek-R1) suffer from low efficiency in chain-of-thought speculative decoding. Method: This work establishes, for the first time, three-dimensional logarithmic-linear scaling laws—governing pretraining data volume, draft model capacity, and batch size (Theorems 1.1–1.3)—and constructs the first multi-dimensional scalability theory framework tailored to speculative decoding efficiency. Based on this, we propose Scylla, a dense-architecture-driven draft-verify parallel decoder that jointly optimizes model capacity, data scale, and batch configuration. Results: At temperature T=0, Scylla achieves 1.5–2.2× higher acceptance rates than EAGLE2 and outperforms EAGLE3; attains state-of-the-art performance on summarization and question-answering tasks; and demonstrates double the throughput of EAGLE2 on industrial inference engines.

Technology Category

Application Category

📝 Abstract

The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining->SFT->RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and decoding batch size. Building on these laws, we achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs (Llama2/3, Qwen2.5). Empirical validation shows Scylla achieves 1.5-2.2 higher acceptance rate than EAGLE2 and 0.3 higher than EAGLE3 at temperature T = 0, with peak performance gains on summarization and QA tasks (Figure 2). Industrial inference engine deployments demonstrate 2X decoding throughput improvements over EAGLE2 (Table 5), validating the transformative potential of systematic scaling for efficient LLM inference. Code will be released later.

Problem

Research questions and friction points this paper is trying to address.

Investigates scaling laws for speculative decoding in large language models

Explores efficiency of draft-verification cycles in reasoning tasks

Develops Scylla to improve decoding throughput and acceptance rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Log-linear Scaling Laws for decoding speed

Multi-dimensional scaling with Scylla

Speculative decoding with parallel draft-verification

🔎 Similar Papers

No similar papers found.