π€ AI Summary
This study reveals that frozen large language models (LLMs) pretrained solely on next-token prediction (NTP) inherently possess multi-token prediction (MTP) capability, and systematically investigates the modeling bottlenecks and efficient utilization pathways for MTP. To address the difficulty of adapting MTP heads to frozen LLMs, we propose a zero-finetuning parallel generation method based on numerical marginalization, along with a plug-and-play MTP head and a joint training paradigm. Key findings are: (1) hidden layers of NTP models are highly specialized for single-step prediction, constituting the fundamental bottleneck for MTP adaptation; (2) pure marginalization-based MTP is feasible but exhibits data-distribution-dependent generalization; (3) joint training of the MTP head with the backbone improves performance yet remains constrained by NTP-specific representations; (4) MTP capability scales significantly with model size. This work provides the first empirical validation of the intrinsic parallel generation potential of NTP-pretrained models and its scalingθ§εΎ.
π Abstract
We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.