Attention Is All You Need For Mixture-of-Depths Routing

📅 2024-12-30

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Traditional Mixture-of-Depth (MoD) models suffer from training instability, high computational complexity, and deployment overhead due to dedicated routing layers. To address these issues, this paper proposes a parameter-free, attention-driven dynamic depth routing mechanism. Instead of introducing auxiliary routing networks, our method reuses the self-attention maps from the preceding layer as token-wise routing signals for the current layer—enabling zero-parameter overhead and seamless plug-and-play integration with pretrained Vision Transformers (ViTs). The core innovation lies in leveraging the inherent self-attention mechanism for dynamic, input-adaptive computation allocation, thereby balancing inference efficiency and model capacity. On ImageNet, our approach achieves up to 2% higher top-1 accuracy than standard MoD baselines and outperforms ViT counterparts with comparable FLOPs. Furthermore, in transfer learning scenarios, it accelerates convergence by up to 2×.

Technology Category

Application Category

📝 Abstract

Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, A-MoD allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines. Furthermore, A-MoD improves the MoD training convergence, leading to up to 2x faster transfer learning.

Problem

Research questions and friction points this paper is trying to address.

Deep Learning

Model Optimization

Resource Allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

A-MoD method

parameter-efficient

transfer learning acceleration

🔎 Similar Papers

No similar papers found.