🤖 AI Summary
Traditional Mixture-of-Experts (MoE) architectures employ homogeneous experts and fixed activation patterns, limiting adaptability to input complexity variations and hindering computational efficiency. This work proposes Grove MoE: inspired by the big.LITTLE paradigm, it introduces a heterogeneous expert structure comprising large “main” experts and lightweight “auxiliary” experts, coupled with dynamic gating and parameter reuse mechanisms to enable on-demand expert activation—large experts are invoked for complex inputs, while only small experts are activated for simple ones. Built upon Qwen3-30B via mid-to-late-stage training, the resulting 33B-parameter model activates merely 3.14–3.28B parameters per token, matching or surpassing the performance of larger open-source MoE models while significantly improving inference efficiency and resource utilization. The core contribution lies in being the first to integrate elastic capacity scaling with fine-grained dynamic expert activation into the MoE paradigm.
📝 Abstract
The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.