How Do LLMs Use Their Depth?

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Is deep utilization of large language models (LLMs) uniform across layers? Existing studies lack fine-grained characterization of layer-wise prediction dynamics. Method: We propose a “Guess-then-Refine” computational framework, integrating intermediate representation tracing on open-weight models, lexical frequency and part-of-speech (POS) analysis, multi-step reasoning decomposition, and response format identification—across diverse case studies. Contribution/Results: We present the first empirical evidence of LLMs’ hierarchical prediction mechanism: early layers generate coarse initial guesses primarily grounded in high-frequency tokens, while deeper layers specialize in context-sensitive refinement. Over 70% of initial predictions are successfully corrected in later layers. Furthermore, we identify a transferable, stage-specific depth-wise division of labor across POS tagging, factual recall, and multiple-choice tasks—revealing consistent functional specialization across architectural depths and task types.

Technology Category

Application Category

📝 Abstract
Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a"Guess-then-Refine"framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined>70% of the time, indicating that correct token prediction is not"one-and-done". We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.
Problem

Research questions and friction points this paper is trying to address.

Analyzing layer-wise prediction dynamics in large language models
Investigating how LLMs refine early guesses using contextual depth
Examining computational depth usage across linguistic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed Guess-then-Refine framework for layer dynamics
Traced intermediate representations during model inference
Analyzed depth usage across three linguistic case studies
🔎 Similar Papers
No similar papers found.
Akshat Gupta
Akshat Gupta
UC Berkeley
Knowledge EditingNatural Language ProcessingSpoken Language Modeling
J
Jay Yeung
University of California, Berkeley
G
G. Anumanchipalli
University of California, Berkeley
Anna Ivanova
Anna Ivanova
Georgia Institute of Technology