π€ AI Summary
Multi-token prediction proves effective in pretraining but suffers from poor generalization in downstream fine-tuning scenarios. To address this, we propose MuToRβa method that enables multi-step token prediction by inserting learnable register tokens into the input sequence, requiring no architectural modifications and remaining fully compatible with off-the-shelf pretrained language models and standard next-token prediction objectives. Its key contributions are: (i) the first register-token interpolation mechanism, enabling flexible and scalable prediction horizons; (ii) zero architectural changes and negligible parameter overhead (<0.1%); and (iii) native compatibility with both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT). Extensive experiments demonstrate that MuToR consistently enhances generation quality across pretraining, SFT, and PEFT stages on both language modeling and cross-modal tasks (e.g., visual generation), achieving state-of-the-art performance on multiple benchmarks.
π Abstract
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.