Dr.LLM: Dynamic Layer Routing in LLMs

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the inefficiency of fixed-depth inference in large language models (LLMs)—where simple queries incur redundant computation while complex queries suffer from insufficient depth—this paper proposes a dynamic layer routing framework. During forward pass, a lightweight router dynamically decides, per layer, whether to skip, execute, or repeat computation, enabling on-demand allocation of computational resources. Methodologically, we innovatively employ Monte Carlo Tree Search (MCTS) to generate high-quality supervision signals for router training, requiring no modification to pretrained weights and supporting plug-and-play adaptation. We further incorporate window-based pooling for stable long-sequence routing, class-balanced focal loss, and a bottleneck MLP architecture. Experiments show up to a 3.4-percentage-point accuracy gain on ARC and DART, with an average reduction of five transformer layers per inference. Cross-domain generalization degrades by only 0.85% in accuracy, and inference efficiency improves by up to 7.7 percentage points over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

Problem

Research questions and friction points this paper is trying to address.

Dynamic routing optimizes layer usage in LLMs

Reduces computation waste on simple queries

Enhances reasoning depth for complex tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer routing with lightweight per-layer routers

Monte Carlo Tree Search for high-quality layer configurations

Windowed pooling and focal loss for robust routing

🔎 Similar Papers

No similar papers found.

Authors to Follow