🤖 AI Summary
This work addresses the inefficiency of conventional Transformers, which employ static and uniform depth expansion during training, leading to substantial computational redundancy. The study uncovers, for the first time, a consistent trajectory of attention head maturation from deeper to shallower layers throughout training. Leveraging this insight, the authors propose a dynamic sparse depth allocation mechanism that selects high-information heads based on attention entropy and progressively enhances them through a top-down attention recurrence scheme, enabling structured sparsity to grow organically during training. Evaluated across multiple model scales, the method significantly outperforms static recurrence baselines while reducing training FLOPs overhead from 16–20% to merely 1–3%, thereby achieving markedly improved computational efficiency and parameter utilization.
📝 Abstract
Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.