DND: Boosting Large Language Models with Dynamic Nested Depth

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the inefficiency of off-the-shelf large language models (LLMs) in processing critical tokens and the high parameter and computational overhead of conventional enhancement methods, this paper proposes the Dynamic Nested Depth (DND) mechanism. DND identifies critical tokens at the end of each Transformer layer and applies lightweight nested feed-forward recomputation *only* to those tokens—bypassing redundant full-sequence computation. Innovatively, it introduces a differentiable routing control loss and a threshold stabilization mechanism to enable stable, adaptive critical-token selection, supporting both dense and Mixture-of-Experts (MoE) architectures. DND is inserted during post-training without requiring large-scale fine-tuning, operating directly on pretrained models. Evaluated on Qwen3-1.7B and Qwen3-30B-A3B, DND achieves accuracy gains of +1.88% and +0.87%, respectively, while incurring negligible additional parameters and computational cost.

Technology Category

Application Category

📝 Abstract

We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM performance by reprocessing critical tokens dynamically

Enhancing token selection via router loss and threshold control

Boosting pre-trained models with minimal parameter and computation overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Nested Depth reprocesses critical tokens

Router loss and threshold control token selection

Post-training integration boosts performance minimally

🔎 Similar Papers

No similar papers found.