Provable Long-Range Benefits of Next-Token Prediction

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This paper addresses the fundamental question of how language models trained solely on next-token prediction can generate coherent, long-range structured text. The authors establish, for the first time, a theoretical guarantee within the RNN framework: minimizing next-token prediction loss enables the model to approximate the true data distribution with polynomial-size architecture, thereby achieving indistinguishability—statistical indistinguishability from real text—for token sequences of arbitrary length $k$. This result is derived via probabilistic distribution approximation theory and discriminant analysis under bounded description length, yielding an explicit polynomial upper bound on model size as a function of $k$. The work provides the first complexity-theoretic explanation for how a simple autoregressive objective supports complex structural modeling, revealing the intrinsic capacity of next-token prediction to ensure long-range statistical consistency and structural coherence.

Technology Category

Application Category

📝 Abstract

Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.

Problem

Research questions and friction points this paper is trying to address.

Proves next-token prediction enables learning long-range structure in language.

Shows RNNs can approximate training distribution with bounded model size.

Explains long-range coherence in language models via complexity theory.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Next-token prediction enables long-range structure learning

RNN training approximates full document distribution provably

Polynomial model size bounds ensure k-token indistinguishability

🔎 Similar Papers

No similar papers found.