Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

📅 2024-10-11

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the challenges of deploying Sparse Mixture-of-Experts (SMoE) models on memory-constrained devices and the excessive GPU memory overhead caused by redundant expert parameters, this paper proposes HC-SMoE—a hierarchical clustering-based expert fusion framework that requires no retraining. Its core innovation lies in a novel bottom-up hierarchical clustering strategy based on expert output similarity, which eliminates reliance on the routing mechanism and enables task-agnostic, robust merging of functionally equivalent experts. By integrating expert weight reconstruction with theoretically grounded function-consistency analysis, HC-SMoE ensures behavioral stability of the compressed model. Evaluated on state-of-the-art models including Qwen and Mixtral, HC-SMoE achieves up to 50% expert parameter compression, incurs less than 0.5% zero-shot performance degradation, reduces inference GPU memory consumption by 35%, and imposes zero retraining overhead.

Technology Category

Application Category

📝 Abstract

Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

SMoE Model

Limited Device Memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

HC-SMoE

Hierarchical Clustering

Memory Optimization

🔎 Similar Papers

No similar papers found.