LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MoE-PEFT integration methods predominantly rely on heuristic Top-K routing, requiring manual hyperparameter tuning and enforcing a fixed number of experts per token. This work proposes LD-MoLE, the first LoRA-MoE framework featuring joint token- and layer-wise adaptive routing. It employs a differentiable routing function with a closed-form solution to dynamically determine the number of experts activated per token at each layer. Additionally, it introduces an analytical sparsity control objective that explicitly regularizes expert activation magnitude. Experiments on Qwen3-1.7B and Llama-3.2-3B demonstrate that LD-MoLE achieves state-of-the-art average performance across multiple benchmarks. Crucially, it enhances routing flexibility and training stability without sensitivity to hyperparameter tuning—eliminating the need for manual adjustment of routing thresholds or expert counts.

Technology Category

Application Category

📝 Abstract
Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
Problem

Research questions and friction points this paper is trying to address.

Replaces fixed TopK routing with learnable dynamic expert allocation
Enables adaptive token-dependent expert selection across model layers
Introduces differentiable routing with analytical sparsity control mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable dynamic routing for LoRA experts
Differentiable routing with closed-form solution
Adaptive layer-wise expert allocation per token