Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the challenge of enabling large language models (LLMs) to automatically identify and collaboratively leverage domain-specific expert capabilities during multi-domain instruction tuning. We propose an end-to-end dense-to-sparse Mixture-of-Experts (MoE) architecture transformation framework: during instruction tuning, a learnable routing network autonomously discovers multiple structured sparse experts—without requiring human annotations or domain priors—and a sparse interpolation mechanism enables efficient knowledge transfer and dynamic expert fusion. The method achieves both parameter sparsity and high representational capacity. It attains state-of-the-art performance on major instruction-tuning benchmarks, significantly outperforming existing dense fine-tuning and MoE-based approaches, while delivering superior trade-offs between model performance and computational cost.

Technology Category

Application Category

📝 Abstract

We present Sparse Interpolated Mixture-of-Experts (SIMoE) instruction-tuning, an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model that possesses capabilities in multiple specialized domains. During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint, with each expert representing a structurally sparse subset of the seed LLM's parameters that correspond to domain-specific knowledge within the data. SIMoE simultaneously learns an input-dependent expert merging strategy via a router network, leveraging rich cross-expert knowledge for superior downstream generalization that surpasses existing baselines. Empirically, SIMoE consistently achieves state-of-the-art performance on common instruction-tuning benchmarks while maintaining an optimal performance-compute trade-off compared to all baselines.

Problem

Research questions and friction points this paper is trying to address.

Automatically identify specialized experts in LLMs

Learn input-dependent expert merging strategy

Achieve superior downstream generalization performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Interpolated Mixture-of-Experts fine-tuning

Automatically identifies domain-specific sparse experts

Learns input-dependent expert merging via router

🔎 Similar Papers

Upcycling Large Language Models into Mixture of Experts