Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited candidate token diversity and low acceptance rate in speculative decoding (SD) caused by representational homogeneity in draft models, this paper proposes an Mixture-of-Experts (MoE)-driven decoupled multi-head speculation mechanism: multiple independent experts are activated in parallel within a single step to generate heterogeneous candidates, overcoming the candidate homogenization bottleneck inherent in conventional tree-based sampling. We further design a hybrid decoding strategy that jointly leverages autoregressive and parallel verification, augmented by a contrastive learning–enhanced target model feature verification module. The resulting framework features dynamic, stage-aware switching between speculation modes. Evaluated across diverse LLM scales, our method achieves significant improvements—+12.3% in draft token acceptance rate and up to 2.1× end-to-end inference speedup—establishing a new state-of-the-art for SD. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.
Problem

Research questions and friction points this paper is trying to address.

Improves speculative decoding with diverse predictions
Introduces hybrid inference strategy for better accuracy
Enhances speed and accuracy in language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts (MoE)
Hybrid inference strategy
Contrastive mechanism enhancement
🔎 Similar Papers
No similar papers found.
H
Haiduo Huang
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China
F
Fuwei Yang
Advanced Micro Devices, Inc., Beijing, China
Z
Zhenhua Liu
Advanced Micro Devices, Inc., Beijing, China
Yixing Xu
Yixing Xu
AMD
machine learningdeep learning
J
Jinze Li
University of Hong Kong, Hong Kong, China
Y
Yang Liu
Advanced Micro Devices, Inc., Beijing, China
X
Xuanwu Yin
Advanced Micro Devices, Inc., Beijing, China
D
Dong Li
Advanced Micro Devices, Inc., Beijing, China
Pengju Ren
Pengju Ren
Professor, Xi'an Jiaotong University
Emad Barsoum
Emad Barsoum
AMD, Columbia University
Generative AIFoundation ModelsAgentic AIComputer VisionML Frameworks