Self-Distillation for Multi-Token Prediction

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of multi-token prediction (MTP) in large language models, where low acceptance rates of prediction heads and challenges in jointly training multiple heads hinder effective inference acceleration. To overcome these issues, the authors propose MTP-D, a method that employs a lightweight self-distillation mechanism to enhance the acceptance rate of MTP heads and introduces a cyclic expansion strategy to efficiently scale the number of prediction heads. Evaluated across seven benchmarks, MTP-D demonstrates consistent effectiveness with negligible degradation to the main language model’s performance: it improves MTP head acceptance by 7.5% and accelerates single-head MTP inference by 220.4%, substantially advancing the practical deployment of MTP in real-world applications.

Technology Category

Application Category

πŸ“ Abstract
As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Multi-Token Prediction
inference efficiency
acceptance rate
joint training
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Distillation
Multi-Token Prediction
Inference Acceleration
Looped Extension
Large Language Models
πŸ”Ž Similar Papers
No similar papers found.
G
Guoliang Zhao
Large Language Model Department, Tencent
Ruobing Xie
Ruobing Xie
Tencent
Large Language ModelRecommender SystemNatural Language Processing
A
An Wang
Large Language Model Department, Tencent
Shuaipeng Li
Shuaipeng Li
Tencent
H
Huaibing Xie
Large Language Model Department, Tencent
Xingwu Sun
Xingwu Sun
Tencent
Natural Language ProcessingQuestion AnsweringQuestion Generation