Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

📅 2024-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of deploying Sparse Mixture-of-Experts (SMoE) models on memory-constrained devices and the excessive GPU memory overhead caused by redundant expert parameters, this paper proposes HC-SMoE—a hierarchical clustering-based expert fusion framework that requires no retraining. Its core innovation lies in a novel bottom-up hierarchical clustering strategy based on expert output similarity, which eliminates reliance on the routing mechanism and enables task-agnostic, robust merging of functionally equivalent experts. By integrating expert weight reconstruction with theoretically grounded function-consistency analysis, HC-SMoE ensures behavioral stability of the compressed model. Evaluated on state-of-the-art models including Qwen and Mixtral, HC-SMoE achieves up to 50% expert parameter compression, incurs less than 0.5% zero-shot performance degradation, reduces inference GPU memory consumption by 35%, and imposes zero retraining overhead.

Technology Category

Application Category

📝 Abstract
Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
SMoE Model
Limited Device Memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

HC-SMoE
Hierarchical Clustering
Memory Optimization
🔎 Similar Papers
No similar papers found.
I
I-Chun Chen
National Tsing Hua University
H
Hsu-Shen Liu
National Tsing Hua University
W
Wei-Fang Sun
NVIDIA AI Technology Center (NV AITC)
C
Chen-Hao Chao
National Tsing Hua University
Yen-Chang Hsu
Yen-Chang Hsu
TSMC
Language modelingArtificial intelligenceComputer visionOn-device AI
Chun-Yi Lee
Chun-Yi Lee
Department of Computer Science and Information Engineering, National Taiwan University
Intelligent RoboticsDeep Reinforcement LearningComputer VisionVirtual-to-Real Transfer