MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Deploying large MoE models faces dual challenges: prohibitively high memory overhead and substantial accuracy degradation (7–14% relative drop) under existing parameter compression techniques. To address this, we propose a low-distortion compression method that decomposes each expert’s weight matrix into an expert-specific component and a shared basis matrix across experts, integrated with low-rank approximations of the gating and up-projection layers and linear reconstruction—collectively forming a *basis-expert mixture* mechanism. This design preserves expert specialization while drastically reducing parameter redundancy. Experiments on multiple ultra-large-scale MoE models demonstrate 24–30% parameter compression with only 1–2% absolute accuracy loss (corresponding to ~2% reduction in relative error), significantly outperforming state-of-the-art compression approaches.

Technology Category

Application Category

📝 Abstract

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

Problem

Research questions and friction points this paper is trying to address.

Compress MoE-based LLMs to reduce memory requirements

Minimize accuracy drops during model compression

Share basis matrices across experts for efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes expert matrices via rank decomposition

Reparameterizes larger matrix with shared basis

Minimizes reconstruction error for accuracy

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models