LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing MoE-PEFT integration methods predominantly rely on heuristic Top-K routing, requiring manual hyperparameter tuning and enforcing a fixed number of experts per token. This work proposes LD-MoLE, the first LoRA-MoE framework featuring joint token- and layer-wise adaptive routing. It employs a differentiable routing function with a closed-form solution to dynamically determine the number of experts activated per token at each layer. Additionally, it introduces an analytical sparsity control objective that explicitly regularizes expert activation magnitude. Experiments on Qwen3-1.7B and Llama-3.2-3B demonstrate that LD-MoLE achieves state-of-the-art average performance across multiple benchmarks. Crucially, it enhances routing flexibility and training stability without sensitivity to hyperparameter tuning—eliminating the need for manual adjustment of routing thresholds or expert counts.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.

Problem

Research questions and friction points this paper is trying to address.

Replaces fixed TopK routing with learnable dynamic expert allocation

Enables adaptive token-dependent expert selection across model layers

Introduces differentiable routing with analytical sparsity control mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable dynamic routing for LoRA experts

Differentiable routing with closed-form solution

Adaptive layer-wise expert allocation per token

🔎 Similar Papers

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models