FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant degradation of speculative sampling acceleration in large-vocabulary language models (e.g., Llama-3-8B) as vocabulary size increases, this paper proposes a frequency-ordered speculative sampling framework. The core innovation lies in the first integration of word-frequency priors into speculative sampling: a vocabulary-space compression strategy constructs a high-frequency subset to constrain draft generation, coupled with a lightweight single-layer draft model and a draft-then-verify mechanism. Crucially, the framework maintains strict output distribution equivalence under theoretical guarantees while substantially reducing LM Head computation. Experiments across multiple datasets demonstrate an average end-to-end speedup of 1.12× and a 75% reduction in LM Head FLOPs—outperforming the state-of-the-art EAGLE-2.

Technology Category

Application Category

📝 Abstract
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$ imes$ speedup over the state-of-the-art speculative sampling method EAGLE-2.
Problem

Research questions and friction points this paper is trying to address.

Accelerating large-vocabulary language models
Optimizing draft candidate selection
Reducing LM Head computation overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-ranked speculative sampling
Vocabulary space compression
Draft search optimization
🔎 Similar Papers
No similar papers found.
Weilin Zhao
Weilin Zhao
Tsinghua University
Natural Language ProcessingArtificial IntelligenceEfficient LLM
T
Tengyu Pan
Tsinghua University, Beijing, China
X
Xu Han
Tsinghua University, Beijing, China
Y
Yudi Zhang
Harbin Institute of Technology, Harbin, China
A
Ao Sun
Beijing University of Posts and Telecommunications, Beijing, China
Yuxiang Huang
Yuxiang Huang
Tsinghua University
Efficient AINatural Language ProcessingMachine Learning System
K
Kaihuo Zhang
OpenBMB
W
Weilun Zhao
OpenBMB
Y
Yuxuan Li
Tsinghua University, Beijing, China
J
Jianyong Wang
Tsinghua University, Beijing, China
Z
Zhiyuan Liu
Tsinghua University, Beijing, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing