π€ AI Summary
This work addresses the limitations of multi-token prediction (MTP) in large language models, where low acceptance rates of prediction heads and challenges in jointly training multiple heads hinder effective inference acceleration. To overcome these issues, the authors propose MTP-D, a method that employs a lightweight self-distillation mechanism to enhance the acceptance rate of MTP heads and introduces a cyclic expansion strategy to efficiently scale the number of prediction heads. Evaluated across seven benchmarks, MTP-D demonstrates consistent effectiveness with negligible degradation to the main language modelβs performance: it improves MTP head acceptance by 7.5% and accelerates single-head MTP inference by 220.4%, substantially advancing the practical deployment of MTP in real-world applications.
π Abstract
As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.