🤖 AI Summary
This work addresses the “curse of depth” in large language models, where deeper layers contribute minimally, leading to underutilized model capacity. The study systematically investigates how sparsity—both implicit (e.g., weight decay–induced weight sparsity and sparse activation in mixture-of-experts) and explicit (e.g., sparse attention over long contexts and grouped-query attention)—enhances depth utilization by modulating variance propagation. Through depth-scaling experiments and layer-wise intervention analyses, the authors demonstrate that sparsity promotes functional differentiation across layers and reduces output variance. Building on these insights, they propose practical training guidelines that yield a 4.6% accuracy improvement on downstream tasks, confirming sparsity as an effective mechanism for mitigating the curse of depth.
📝 Abstract
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.