🤖 AI Summary
This paper investigates the learning theory of time-invariant chains-of-thought (CoT) constructed iteratively via a fixed autoregressive generator, under two settings: explicit supervision (with full CoT sequences observable) and implicit learning (given only prompt-answer pairs, where CoT is a latent variable).
Method: We establish, for the first time, chain-length-independent generalization bounds using VC-dimension analysis and linear-threshold modeling, deriving tight sample and computational complexity upper bounds; we further prove that attention mechanisms emerge naturally and construct a basis class that simultaneously ensures universal representation capability and computational efficiency.
Contributions/Results: (1) The first generalization bound for time-invariant CoT with sample complexity independent of chain length; (2) A unified theoretical explanation for both CoT learnability and the emergence of attention; (3) A theoretically consistent framework bridging explicit supervision and implicit learning.
📝 Abstract
For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.