From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

📅 2024-10-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Despite sufficient model capacity, chain-of-thought (CoT) prompting significantly improves sample efficiency in Transformers—yet the underlying mechanism remains unclear. Method: We analyze CoT through parity learning theory, synthetic data experiments, real-task validation, and quantitative/visual analysis of attention sparsity. Contribution/Results: We theoretically establish—via sample complexity analysis—that CoT reduces the required sample size from exponential to polynomial by inducing sparse temporal dependencies among input tokens, thereby driving self-attention to become intrinsically sparse and interpretable. Crucially, CoT’s benefit stems not from enhanced representational capacity, but from structured dependency modeling. Empirically, CoT-trained models exhibit markedly sparser and more interpretable attention patterns, yielding 35–62% average gains in cross-task generalization performance.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM). While current theoretical studies often attribute this improvement to increased expressiveness and computational capacity, we argue that expressiveness is not the primary limitation in the LLM regime, as current large models will fail on simple tasks. Using a parity-learning setup, we demonstrate that CoT can substantially improve sample efficiency even when the representation power is sufficient. Specifically, with CoT, a transformer can learn the function within polynomial samples, whereas without CoT, the required sample size is exponential. Additionally, we show that CoT simplifies the learning process by introducing sparse sequential dependencies among input tokens, and leads to a sparse and interpretable attention. We validate our theoretical analysis with both synthetic and real-world experiments, confirming that sparsity in attention layers is a key factor of the improvement induced by CoT.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-thought improves transformer sample efficiency.

CoT enables polynomial sample learning versus exponential.

Sparse attention enhances interpretability and learning efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

CoT enhances transformer sample efficiency

CoT introduces sparse sequential dependencies

Sparsity in attention layers improves performance

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention