FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the throughput bottleneck caused by autoregressive inference in large language models (LLMs), this paper proposes Multi-Token Prediction (MTP), a novel paradigm for efficient inference optimization. Our method introduces three key innovations: (1) a training-inference aligned single-head MTP architecture with position-shared weights to reduce parameter redundancy; (2) a language-aware dynamic vocabulary compression mechanism to improve recursive draft generation quality and token acceptance rate; and (3) a lightweight fine-tuning strategy based on self-distilled data to strengthen modeling of sequential token dependencies. Evaluated across seven benchmarks, our approach achieves an average 2.03× inference speedup with no degradation in output quality—outperforming standard MTP by 82%—while incurring low training overhead and enabling seamless deployment and integration.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM inference bottleneck from sequential autoregressive generation
Improving multi-token prediction quality for speculative decoding performance
Reducing computational overhead while maintaining lossless output quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes MTP head with position-shared weights
Integrates dynamic vocabulary compression for efficiency
Uses self-distilled data to improve draft quality
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Cai
Tencent
Xiaozhuan Liang
Xiaozhuan Liang
Tencent
X
Xinghua Wang
Tencent
J
Jin Ma
Tencent
H
Haijin Liang
Tencent
J
Jinwen Luo
Tencent
X
Xinyu Zuo
Tencent
L
Lisheng Duan
Tencent
Yuyang Yin
Yuyang Yin
Beijing Jiaotong University
Computer VisionAIGC
X
Xi Chen
Tencent