🤖 AI Summary
This work addresses the challenges of expert symmetry and insufficient early specialization in sparse upcycling by proposing a clustering-aware initialization method. It structurally maps the weights of a pretrained dense model to an initial Mixture-of-Experts (MoE) configuration through semantic clustering of input activations and subspace representation via truncated singular value decomposition (SVD). The approach further incorporates centroid-based routing initialization and an expert ensemble self-distillation loss to effectively break symmetry among experts, thereby enhancing routing confidence and representation disentanglement. Evaluated on CLIP ViT-B/32 and ViT-B/16 architectures, the method achieves significantly improved zero-shot and few-shot performance compared to existing approaches.
📝 Abstract
Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.