Multi-Token Prediction via Self-Distillation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work proposes an online self-distillation method to accelerate autoregressive language model inference without requiring auxiliary models or specialized inference pipelines. By leveraging multi-token sequences generated by the model itself as supervision signals, the approach enables parallel prediction of multiple tokens in a single forward pass, without altering the model architecture or introducing additional components. Evaluated on reasoning benchmarks such as GSM8K, the method achieves an average speedup exceeding 3× while incurring less than a 5% drop in accuracy. This significantly enhances inference efficiency while preserving the original model structure and deployment simplicity.

Technology Category

Application Category

📝 Abstract

Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.

Problem

Research questions and friction points this paper is trying to address.

multi-token prediction

language model inference

speculative decoding

self-distillation

autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-token prediction

self-distillation

language model acceleration