🤖 AI Summary
Despite sufficient model capacity, chain-of-thought (CoT) prompting significantly improves sample efficiency in Transformers—yet the underlying mechanism remains unclear. Method: We analyze CoT through parity learning theory, synthetic data experiments, real-task validation, and quantitative/visual analysis of attention sparsity. Contribution/Results: We theoretically establish—via sample complexity analysis—that CoT reduces the required sample size from exponential to polynomial by inducing sparse temporal dependencies among input tokens, thereby driving self-attention to become intrinsically sparse and interpretable. Crucially, CoT’s benefit stems not from enhanced representational capacity, but from structured dependency modeling. Empirically, CoT-trained models exhibit markedly sparser and more interpretable attention patterns, yielding 35–62% average gains in cross-task generalization performance.
📝 Abstract
Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM). While current theoretical studies often attribute this improvement to increased expressiveness and computational capacity, we argue that expressiveness is not the primary limitation in the LLM regime, as current large models will fail on simple tasks. Using a parity-learning setup, we demonstrate that CoT can substantially improve sample efficiency even when the representation power is sufficient. Specifically, with CoT, a transformer can learn the function within polynomial samples, whereas without CoT, the required sample size is exponential. Additionally, we show that CoT simplifies the learning process by introducing sparse sequential dependencies among input tokens, and leads to a sparse and interpretable attention. We validate our theoretical analysis with both synthetic and real-world experiments, confirming that sparsity in attention layers is a key factor of the improvement induced by CoT.