Multi-Token Prediction Needs Registers

πŸ“… 2025-05-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multi-token prediction proves effective in pretraining but suffers from poor generalization in downstream fine-tuning scenarios. To address this, we propose MuToRβ€”a method that enables multi-step token prediction by inserting learnable register tokens into the input sequence, requiring no architectural modifications and remaining fully compatible with off-the-shelf pretrained language models and standard next-token prediction objectives. Its key contributions are: (i) the first register-token interpolation mechanism, enabling flexible and scalable prediction horizons; (ii) zero architectural changes and negligible parameter overhead (<0.1%); and (iii) native compatibility with both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT). Extensive experiments demonstrate that MuToR consistently enhances generation quality across pretraining, SFT, and PEFT stages on both language modeling and cross-modal tasks (e.g., visual generation), achieving state-of-the-art performance on multiple benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-token prediction in language models
Enhancing fine-tuning compatibility with pretrained models
Supporting scalable prediction horizons in generative tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaves learnable register tokens for prediction
Requires no architectural changes to models
Supports scalable prediction horizons naturally
πŸ”Ž Similar Papers
No similar papers found.