Next-Token Prediction and Regret Minimization

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates how next-token prediction models can achieve low adversarial regret in online decision-making against adaptive adversaries. Focusing on models trained on distributions over opponent action sequences, it analyzes whether near-optimal responses derived from such models can yield low regret, distinguishing between settings with unbounded and bounded context windows. Theoretically, under unbounded contexts, any distribution can be exponentially approximated by one achieving low regret, guaranteeing sublinear regret; however, with bounded context windows, there exist distributions that cannot be so approximated. To address this limitation, the paper proposes a robust construction realizable by standard Transformers. Empirical results demonstrate that Transformers can efficiently learn low-regret policies under this construction.

Technology Category

Application Category

📝 Abstract

We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution $\mathcal{D}$ is a low-regret distribution, every distribution $\mathcal{D}$ is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past $w$ actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions $\mathcal{D}$ of opponent play that are $Θ(1)$-far from any low-regret distribution $\mathcal{D'}$ (even when $w = Ω(T)$ and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

Problem

Research questions and friction points this paper is trying to address.

next-token prediction

adversarial regret

online decision-making

low-regret distribution

context window

Innovation

Methods, ideas, or system contributions that make the work stand out.

next-token prediction

regret minimization

low-regret distribution