🤖 AI Summary
This work addresses the challenge of enabling large language models (LLMs) to automatically identify and collaboratively leverage domain-specific expert capabilities during multi-domain instruction tuning. We propose an end-to-end dense-to-sparse Mixture-of-Experts (MoE) architecture transformation framework: during instruction tuning, a learnable routing network autonomously discovers multiple structured sparse experts—without requiring human annotations or domain priors—and a sparse interpolation mechanism enables efficient knowledge transfer and dynamic expert fusion. The method achieves both parameter sparsity and high representational capacity. It attains state-of-the-art performance on major instruction-tuning benchmarks, significantly outperforming existing dense fine-tuning and MoE-based approaches, while delivering superior trade-offs between model performance and computational cost.
📝 Abstract
We present Sparse Interpolated Mixture-of-Experts (SIMoE) instruction-tuning, an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model that possesses capabilities in multiple specialized domains. During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint, with each expert representing a structurally sparse subset of the seed LLM's parameters that correspond to domain-specific knowledge within the data. SIMoE simultaneously learns an input-dependent expert merging strategy via a router network, leveraging rich cross-expert knowledge for superior downstream generalization that surpasses existing baselines. Empirically, SIMoE consistently achieves state-of-the-art performance on common instruction-tuning benchmarks while maintaining an optimal performance-compute trade-off compared to all baselines.