🤖 AI Summary
The functional specialization of transformer layers in large language models (LLMs) across knowledge retrieval, factual memory, and logical reasoning—and its contextual dependence—remains poorly understood.
Method: We conduct systematic layer-wise ablation, likelihood- and generation-based evaluation, cross-task (QA, reasoning, coherent text generation) and cross-architecture (decoder-only vs. encoder-decoder) comparisons, augmented by knowledge distillation to trace functional evolution across depth.
Contribution/Results: We find that shallow layers primarily handle knowledge retrieval and surface-level representation, while middle-to-deep layers are critical for long-range reasoning and generative coherence. Crucially, depth utilization is paradigm-dependent: non-generative tasks tolerate substantial deep-layer pruning without performance loss, whereas generative reasoning strictly requires middle-to-deep layers. Moreover, reasoning capabilities can be selectively distilled and transferred. These findings reveal the functional heterogeneity and context sensitivity of LLM depth, providing theoretical foundations for model compression, interpretability, and capability-aware architecture design.
📝 Abstract
Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.