🤖 AI Summary
This study investigates how the sample complexity of autoregressive language models for learning input-to-output mappings scales with generation length \(T\) under different supervision paradigms. Within the PAC learning framework, the authors develop a formal model of autoregressive reasoning using combinatorial and theoretical analysis, systematically comparing end-to-end supervision—where only the final output is provided—with chain-of-thought (CoT) supervision, which supplies the full reasoning trajectory. Their results demonstrate that under end-to-end supervision, sample complexity can grow arbitrarily between constant and linear in \(T\), whereas CoT supervision entirely eliminates dependence on \(T\). This work nearly completely characterizes the relationship between sample complexity and generation length, resolving several open questions posed by Joshi et al.
📝 Abstract
Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for $T$ steps, producing a chain of tokens whose final token constitutes the model's output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length $T$, and how much Chain-of-Thought supervision can reduce this dependence.
In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with $T$. For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate $r(T)$ between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of $T$, demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision.