Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenges of expert symmetry and insufficient early specialization in sparse upcycling by proposing a clustering-aware initialization method. It structurally maps the weights of a pretrained dense model to an initial Mixture-of-Experts (MoE) configuration through semantic clustering of input activations and subspace representation via truncated singular value decomposition (SVD). The approach further incorporates centroid-based routing initialization and an expert ensemble self-distillation loss to effectively break symmetry among experts, thereby enhancing routing confidence and representation disentanglement. Evaluated on CLIP ViT-B/32 and ViT-B/16 architectures, the method achieves significantly improved zero-shot and few-shot performance compared to existing approaches.

Technology Category

Application Category

📝 Abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

expert symmetry

early specialization

MoE initialization

sparse upcycling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-aware Upcycling

Mixture-of-Experts

Expert Specialization