A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates the learnability of input-output mappings in online learning under autoregressive generation processes, with a focus on how the final prediction error bound depends on the generation horizon $M$ and whether this dependence can be mitigated through intermediate trajectory feedback. We formulate two online learning settings: end-to-end (observing only the final output) and chain-of-thought (observing the full generation trajectory), integrating tools from online learning theory, extensions of the PAC framework, and analysis of autoregressive linear threshold functions. Our main contributions are threefold: we establish an unavoidable logarithmic dependence of the error bound on $M$ in the end-to-end setting and fully characterize it across constant to logarithmic regimes; we prove for the first time that the chain-of-thought setting achieves an optimal error bound independent of $M$, thereby eliminating this dependence entirely; and we provide new statistical lower bounds that resolve a long-standing open problem in this direction.

📝 Abstract

Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].

Problem

Research questions and friction points this paper is trying to address.

online learning

autoregressive generation

mistake bound

chain-of-thought

generation horizon

Innovation

Methods, ideas, or system contributions that make the work stand out.

online learning

autoregressive generation

chain-of-thought reasoning