🤖 AI Summary
Large language models (LLMs) often exhibit degenerate attention matrices in deep layers—approximating rank-one “lazy layers”—leading to structural redundancy and computational inefficiency. This work presents the first systematic investigation of this phenomenon and proposes a lightweight, efficient training paradigm based on layer inheritance and progressive depth expansion: early-layer parameters are reused, while only deeper modules are retrained and incrementally added. Crucially, this approach abandons the conventional reliance on parameter scaling. On GPT-2 medium, it achieves performance comparable to or exceeding that of the standard 24-layer model using only 16 layers—reducing total parameters by ~33% and significantly lowering inference latency. The method is architecture-agnostic, compatible with mainstream Transformer variants, and validated across diverse datasets including OpenWebText-9B and FineWeb_edu. Code is publicly released.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable performance across various natural language processing tasks, primarily due to the transformer architecture and its self-attention mechanism. However, we observe that in standard decoder-style LLMs, attention matrices degenerate to single-column for deeper layers. Layers in this state are unable to learn anything meaningful and mostly redundant; we refer to these as lazy layers. The goal of this paper is to train smaller models by eliminating this structural inefficiency without compromising performance. Motivated by this observation, we propose Inheritune, a simple yet effective training recipe for developing smaller, high-performing language models. Smaller models trained with Inheritune, inherit early transformer layers from a larger pre-trained model, then retrain and progressively expand until they match or exceed the performance of the larger model. We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb_edu. Models trained with Inheritune, despite having significantly fewer layers, match or even surpass the performance of their larger counterparts. For instance, our 16-layer GPT-2 medium variant achieves comparable performance to the standard 24-layer GPT-2 medium model. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune.