Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep models suffer from the “depth curse”: in standard pre-layer-normalized Transformers, contributions of later layers to the output distribution decay significantly, resulting in suboptimal depth utilization. This work establishes, for the first time, a causal link between depth-growth mechanisms and the depth curse. We propose MIDAS—a lightweight method that reshapes residual flow via progressive intermediate-layer stacking to achieve balanced layer-wise contributions, and introduces a permutation-invariant, structured computational module. Leveraging hierarchical depth analysis and residual-flow visualization, we empirically validate that MIDAS effectively mitigates contribution decay. Experiments demonstrate that MIDAS substantially improves performance on downstream inference benchmarks—without compromising training efficiency—while enhancing both the representational capacity and modularizability of deep Transformers.

Technology Category

Application Category

📝 Abstract
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csord'as et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
Problem

Research questions and friction points this paper is trying to address.

Analyzes if depth-grown Transformers overcome the Curse of Depth
Investigates how gradual depth growth improves model depth utilization
Proposes a modified MIDAS to enhance downstream reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradual depth growth reduces training costs
Middle stacking improves model depth utilization
Lightweight MIDAS modification enhances reasoning benchmarks
🔎 Similar Papers
No similar papers found.
F
Ferdinand Kapl
Technical University of Munich
Emmanouil Angelis
Emmanouil Angelis
PHD student, Helmholtz AI/TUM
Machine LearningCausality
T
Tobias Hoppe
Technical University of Munich
K
Kaitlin Maile
Google, Paradigms of Intelligence Team
Johannes von Oswald
Johannes von Oswald
Research Scientist, Google
Deep Learning
Nino Scherrer
Nino Scherrer
Google
CausalityRobustnessCognitionLLM EvaluationsSynthetic Data
S
Stefan Bauer
Technical University of Munich