The pitfalls of next-token prediction

📅 2024-03-11
🏛️ International Conference on Machine Learning
📈 Citations: 89
Influential: 11
📄 PDF
🤖 AI Summary
This work challenges the sufficiency of next-token prediction for modeling human-like intelligence in large language models (LLMs), identifying a fundamental limitation of teacher-forcing training in structurally sensitive planning tasks: even without autoregressive error accumulation, models fail to learn accurate next-token prediction. To address this, the paper provides the first systematic analysis of the intrinsic failure mechanisms of teacher-forcing under such task conditions and proposes multi-token forward prediction as a more robust alternative training objective. Empirical evaluation on a custom minimal planning task—using both Transformer and Mamba architectures—demonstrates that standard teacher-forcing fails catastrophically, whereas incorporating multi-token prediction restores strong performance. The study thus offers a novel, empirically grounded perspective on LLM training paradigms and establishes a verifiable pathway for improving sequence modeling in structured reasoning tasks.

Technology Category

Application Category

📝 Abstract
Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using a simple modification that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures
Problem

Research questions and friction points this paper is trying to address.

Examines limitations of next-token prediction in modeling human intelligence
Identifies failures in teacher-forced training for accurate next-token prediction
Proposes multi-token objective to address next-token prediction shortcomings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advocates multi-token objective training
Exposes teacher-forcing failure mechanism
Proposes teacherless training with dummy tokens
🔎 Similar Papers
No similar papers found.