A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work investigates the learnability of input-output mappings in online learning under autoregressive generation processes, with a focus on how the final prediction error bound depends on the generation horizon \(M\) and whether this dependence can be mitigated through intermediate trajectory feedback. We formulate two online learning settings: end-to-end (observing only the final output) and chain-of-thought (observing the full generation trajectory), integrating tools from online learning theory, extensions of the PAC framework, and analysis of autoregressive linear threshold functions. Our main contributions are threefold: we establish an unavoidable logarithmic dependence of the error bound on \(M\) in the end-to-end setting and fully characterize it across constant to logarithmic regimes; we prove for the first time that the chain-of-thought setting achieves an optimal error bound independent of \(M\), thereby eliminating this dependence entirely; and we provide new statistical lower bounds that resolve a long-standing open problem in this direction.
📝 Abstract
Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].
Problem

Research questions and friction points this paper is trying to address.

online learning
autoregressive generation
mistake bound
chain-of-thought
generation horizon
Innovation

Methods, ideas, or system contributions that make the work stand out.

online learning
autoregressive generation
chain-of-thought reasoning
mistake bound
PAC learning
🔎 Similar Papers
2023-01-30arXiv.orgCitations: 5