Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

πŸ“… 2025-02-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates error amplification in autoregressive sequence modeling due to model misspecification: in next-token prediction, the approximation factor $ C $ grows linearly with sequence length $ H $β€”i.e., $ C = Theta(H) $β€”revealing a fundamental computational-statistical trade-off. The authors provide the first rigorous proof of an $ Omega(H) $ lower bound on $ C $. They further show that while information-theoretically an $ O(1) $ approximation is achievable, no efficient algorithm can attain better than subexponential approximation. A unified analytical framework is developed, linking error amplification directly to generalization failure in imitation learning (behavioral cloning). Finally, they construct a subexponential-time algorithm achieving sublinear $ C $ in the binary token space, and extend their results to general misspecified autoregressive modeling and suboptimal approximation settings.

Technology Category

Application Category

πŸ“ Abstract
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $Cgeq 1$ -- we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. (2) Next-token prediction can be made robust so as to achieve $C= ilde O(H)$, representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer $C=Omega(H)$. (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e^{(log H)^{1-Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning algorithm generalizes next-token prediction.
Problem

Research questions and friction points this paper is trying to address.

Explores error amplification in next-token prediction models.
Investigates computational-statistical tradeoffs under model misspecification.
Assesses robustness and efficiency in autoregressive learning algorithms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Avoids error amplification information-theoretically
Achieves moderate error amplification robustly
Trades compute for statistical power efficiently
πŸ”Ž Similar Papers
No similar papers found.