Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing LoRA-MoE approaches employ a uniform expert configuration across heterogeneous Transformer modules, disregarding their functional differences and varying capacity requirements. This leads to parameter redundancy, high optimizer overhead, and constrained expert specialization due to enforced load balancing. To address these limitations, this work proposes DMEP, a framework that introduces dynamic expert pruning at the module level for the first time. By monitoring expert utilization, DMEP physically prunes inefficient experts on a per-module basis, yielding compact architectures tailored to each module’s characteristics. Following pruning, load balancing constraints are relaxed, enabling remaining experts to specialize effectively for downstream tasks. Experiments demonstrate that DMEP reduces trainable parameters by 35%–43% and improves training throughput by approximately 10% across multiple reasoning benchmarks, while matching or exceeding the accuracy of baseline LoRA-MoE models.

📝 Abstract

LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE). However, existing LoRA-MoE frameworks typically adopt a fixed and uniform expert configuration across heterogeneous Transformer modules (\eg, attention query/key projections and MLP gating networks), ignoring their distinct functional roles and capacity requirements. This design leads to localized over-provisioning, redundant trainable parameters, and unnecessary optimizer-state overhead. Moreover, prior methods enforce load balancing among experts throughout training. Although beneficial in the early stage, this constraint becomes restrictive once routing patterns stabilize, limiting expert specialization on downstream tasks. In this paper, we propose DMEP, a novel LoRA-MoE fine-tuning framework based on Dynamic Module-wise Expert Pruning. DMEP tracks expert utilization during training and physically removes low-utility experts on a per-module basis, yielding a more compact expert structure tailored to different modules. The pruned model then continues training without the load-balancing constraint, freeing the remaining experts to focus entirely on the downstream task and develop specialized expertise. By jointly adapting module-wise expert capacity and eliminating unnecessary balancing, DMEP improves both parameter efficiency and training efficiency. Extensive experiments on multiple reasoning benchmarks show that DMEP reduces trainable parameters by 35\%--43\% and improves training throughput by about 10\%, while maintaining or surpassing the downstream reasoning accuracy of uniform LoRA-MoE baselines.

Problem

Research questions and friction points this paper is trying to address.

LoRA-MoE

expert pruning

module-wise adaptation

load balancing

parameter efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-MoE

expert pruning

module-wise adaptation