Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the high computational cost and deployment challenges of Vision Transformers (ViTs), this paper proposes a post-training, retraining-free Mixture-of-Experts (MoE) extraction method. The approach operates in two stages: first, data-driven clustering of MLP-layer activation patterns identifies latent expert structures; second, corresponding sparse subnetworks are reverse-identified and extracted. A lightweight sparse router is then constructed and adapted with minimal fine-tuning. This work presents the first fully fine-tuning-free, purely data-driven decomposition into high-performance MoE architectures, overcoming the static structural constraints of pre-trained models. Evaluated on ImageNet-1k, the method recovers 98% of the original ViT’s accuracy while reducing MACs by 36% and parameters by 32%. Moreover, it enables plug-and-play expert integration and dynamic expert scaling—facilitating efficient, modular inference without architectural modification.

Technology Category

Application Category

📝 Abstract

Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model's MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing MACs and model size, by up to 36% and 32% respectively.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs of Vision Transformers

Extracting Mixture-of-Experts from pretrained models

Maintaining performance while decreasing model size

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extract expert subnetworks from pretrained MLP layers

Cluster activations to identify distinct patterns

Minimal fine-tuning regains 98% original performance

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions