🤖 AI Summary
Traditional Transformers incur high computational cost due to quadratic-complexity self-attention over all tokens per layer. This paper proposes DTRNet, which decouples token updating from attention mixing and introduces a dynamic token routing mechanism: only ~10% of tokens undergo full self-attention, while the rest are updated via lightweight linear modules—enabling adaptive computation allocation. Unlike layer-skipping or coarse-grained expert routing, DTRNet operates at fine-grained token level without sacrificing modeling capacity. It significantly reduces FLOPs and memory footprint while preserving performance. Under identical compute budgets, DTRNet outperforms MoD, D-LLM, and other baselines across multiple benchmarks; on long-sequence tasks, it achieves substantial FLOPs reduction with minimal accuracy degradation. The core contribution is the first integration of fine-grained dynamic token routing with a hybrid linear-sparse attention architecture—effectively balancing computational efficiency and representational power.
📝 Abstract
Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention to every token at every layer makes them computationally expensive. We introduce DTRNet (Dynamic Token Routing Network), an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing while still receiving lightweight linear updates. By preserving the MLP module and reducing the attention cost for most tokens to linear, DTRNet ensures that every token is explicitly updated while significantly lowering overall computation. This design offers an efficient and effective alternative to standard dense attention. Once trained, DTRNet blocks routes only ~10% of tokens through attention at each layer while maintaining performance comparable to a full Transformer. It consistently outperforms routing-based layer skipping methods such as MoD and D-LLM in both accuracy and memory at matched FLOPs, while routing fewer tokens to full attention. Its efficiency gains, scales with sequence length, offering significant reduction in FLOPs for long-context inputs. By decoupling token updates from attention mixing, DTRNet substantially reduces the quadratic share of computation, providing a simple, efficient, and scalable alternative to Transformers.