FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MoE-LoRA methods rely on discrete routers, preventing full integration of expert modules with the backbone model and incurring non-negligible inference overhead. To address this, we propose FURINA—a router-free Mixture-of-Experts Low-Rank Adaptation framework. FURINA eliminates explicit routing by introducing direction-magnitude-decoupled LoRA adapters, an angle-similarity-based self-routing mechanism, shared magnitude-vector scaling, and a sparsity-driven expert selection loss. This enables dynamic expert activation and yields an end-to-end mergeable architecture. Empirically, FURINA significantly outperforms standard LoRA across multiple tasks, matches or exceeds state-of-the-art MoE-LoRA methods in performance, eliminates routing computation overhead entirely, supports zero-cost model merging, and—crucially—achieves the first seamless, unified deployment of MoE-LoRA with the backbone model.

Technology Category

Application Category

📝 Abstract
The Mixture of Experts (MoE) paradigm has been successfully integrated into Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning (PEFT), delivering performance gains with minimal parameter overhead. However, a key limitation of existing MoE-LoRA methods is their reliance on a discrete router, which prevents the integration of the MoE components into the backbone model. To overcome this, we propose FURINA, a novel Free from Unmergeable Router framework based on the LINear Aggregation of experts. FURINA eliminates the router by introducing a Self-Routing mechanism. This is achieved through three core innovations: (1) decoupled learning of the direction and magnitude for LoRA adapters, (2) a shared learnable magnitude vector for consistent activation scaling, and (3) expert selection loss that encourages divergent expert activation. The proposed mechanism leverages the angular similarity between the input and each adapter's directional component to activate experts, which are then scaled by the shared magnitude vector. This design allows the output norm to naturally reflect the importance of each expert, thereby enabling dynamic, router-free routing. The expert selection loss further sharpens this behavior by encouraging sparsity and aligning it with standard MoE activation patterns. We also introduce a shared expert within the MoE-LoRA block that provides stable, foundational knowledge. To the best of our knowledge, FURINA is the first router-free, MoE-enhanced LoRA method that can be fully merged into the backbone model, introducing zero additional inference-time cost or complexity. Extensive experiments demonstrate that FURINA not only significantly outperforms standard LoRA but also matches or surpasses the performance of existing MoE-LoRA methods, while eliminating the extra inference-time overhead of MoE.
Problem

Research questions and friction points this paper is trying to address.

Eliminates discrete router in MoE-LoRA methods
Enables full integration into backbone model
Removes inference-time overhead while maintaining performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear aggregation replaces discrete router
Self-routing via angular similarity activation
Shared magnitude vector enables consistent scaling
J
Jiayi Han
Inspur Genersoft, Inspur Group
Liang Du
Liang Du
Associate Professor, Villanova University
electric power systems
Yinda Chen
Yinda Chen
University of Science and Technology of China, Xiamen University
Machine Learning TheorySelf-supervised LearningImage Compression
X
Xiao Kang
Shandong University
W
Weiyang Ding
Fudan University
D
Donghong Han
Northeastern University, China