From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

📅 2024-10-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite sufficient model capacity, chain-of-thought (CoT) prompting significantly improves sample efficiency in Transformers—yet the underlying mechanism remains unclear. Method: We analyze CoT through parity learning theory, synthetic data experiments, real-task validation, and quantitative/visual analysis of attention sparsity. Contribution/Results: We theoretically establish—via sample complexity analysis—that CoT reduces the required sample size from exponential to polynomial by inducing sparse temporal dependencies among input tokens, thereby driving self-attention to become intrinsically sparse and interpretable. Crucially, CoT’s benefit stems not from enhanced representational capacity, but from structured dependency modeling. Empirically, CoT-trained models exhibit markedly sparser and more interpretable attention patterns, yielding 35–62% average gains in cross-task generalization performance.

Technology Category

Application Category

📝 Abstract
Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM). While current theoretical studies often attribute this improvement to increased expressiveness and computational capacity, we argue that expressiveness is not the primary limitation in the LLM regime, as current large models will fail on simple tasks. Using a parity-learning setup, we demonstrate that CoT can substantially improve sample efficiency even when the representation power is sufficient. Specifically, with CoT, a transformer can learn the function within polynomial samples, whereas without CoT, the required sample size is exponential. Additionally, we show that CoT simplifies the learning process by introducing sparse sequential dependencies among input tokens, and leads to a sparse and interpretable attention. We validate our theoretical analysis with both synthetic and real-world experiments, confirming that sparsity in attention layers is a key factor of the improvement induced by CoT.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-thought improves transformer sample efficiency.
CoT enables polynomial sample learning versus exponential.
Sparse attention enhances interpretability and learning efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

CoT enhances transformer sample efficiency
CoT introduces sparse sequential dependencies
Sparsity in attention layers improves performance
🔎 Similar Papers
No similar papers found.
Kaiyue Wen
Kaiyue Wen
Phd Student, Stanford University
Machine LearningNatural Language Processing
H
Huaqing Zhang
IIIS, Tsinghua University
Hongzhou Lin
Hongzhou Lin
Amazon
Artificial IntelligenceLLMOptimizationTheory of Deep Learning
J
Jingzhao Zhang
IIIS, Tsinghua University, Shanghai AI Lab, Shanghai Qizhi Institute