Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Mixture-of-Experts (MoE) architectures employ homogeneous experts and fixed activation patterns, limiting adaptability to input complexity variations and hindering computational efficiency. This work proposes Grove MoE: inspired by the big.LITTLE paradigm, it introduces a heterogeneous expert structure comprising large “main” experts and lightweight “auxiliary” experts, coupled with dynamic gating and parameter reuse mechanisms to enable on-demand expert activation—large experts are invoked for complex inputs, while only small experts are activated for simple ones. Built upon Qwen3-30B via mid-to-late-stage training, the resulting 33B-parameter model activates merely 3.14–3.28B parameters per token, matching or surpassing the performance of larger open-source MoE models while significantly improving inference efficiency and resource utilization. The core contribution lies in being the first to integrate elastic capacity scaling with fine-grained dynamic expert activation into the MoE paradigm.

Technology Category

Application Category

📝 Abstract
The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.
Problem

Research questions and friction points this paper is trying to address.

Traditional MoE uses homogeneous experts limiting computational efficiency
Grove MoE introduces varying expert sizes for better efficiency
Dynamic activation adjusts parameters based on input complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous experts with varying sizes
Dynamic activation mechanism for efficiency
Upscaling strategy for model development
🔎 Similar Papers
No similar papers found.