🤖 AI Summary
This work proposes an online self-distillation method to accelerate autoregressive language model inference without requiring auxiliary models or specialized inference pipelines. By leveraging multi-token sequences generated by the model itself as supervision signals, the approach enables parallel prediction of multiple tokens in a single forward pass, without altering the model architecture or introducing additional components. Evaluated on reasoning benchmarks such as GSM8K, the method achieves an average speedup exceeding 3× while incurring less than a 5% drop in accuracy. This significantly enhances inference efficiency while preserving the original model structure and deployment simplicity.
📝 Abstract
Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.