🤖 AI Summary
This study investigates whether large language models leverage their depth more effectively in multi-turn autonomous agent planning tasks compared to single-turn static settings. Employing residual stream probing, causal layer-skipping interventions, and effective depth measurements across three diverse domains—Deep Research, code generation, and tabular reasoning—the work reveals, for the first time, an adaptive depth utilization mechanism in agent reasoning: shallow layers rapidly construct a semantic scaffold, while deeper layers refine and stabilize outputs. The analysis shows that models progressively activate deeper network layers as tasks unfold, exhibiting stronger cross-layer dependencies and correction-driven updates in later stages. Notably, Qwen and Minimax display a pronounced “construction–refinement depth gap,” whereas GLM exhibits domain-dependent depth usage patterns.
📝 Abstract
Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.