Multi-Token Prediction Needs Registers

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Multi-token prediction proves effective in pretraining but suffers from poor generalization in downstream fine-tuning scenarios. To address this, we propose MuToR—a method that enables multi-step token prediction by inserting learnable register tokens into the input sequence, requiring no architectural modifications and remaining fully compatible with off-the-shelf pretrained language models and standard next-token prediction objectives. Its key contributions are: (i) the first register-token interpolation mechanism, enabling flexible and scalable prediction horizons; (ii) zero architectural changes and negligible parameter overhead (<0.1%); and (iii) native compatibility with both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT). Extensive experiments demonstrate that MuToR consistently enhances generation quality across pretraining, SFT, and PEFT stages on both language modeling and cross-modal tasks (e.g., visual generation), achieving state-of-the-art performance on multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-token prediction in language models

Enhancing fine-tuning compatibility with pretrained models

Supporting scalable prediction horizons in generative tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaves learnable register tokens for prediction

Requires no architectural changes to models

Supports scalable prediction horizons naturally

🔎 Similar Papers

The pitfalls of next-token prediction