ExFusion: Efficient Transformer Training via Multi-Experts Fusion

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high training and deployment costs of Mixture-of-Experts (MoE) models despite their strong performance. To mitigate these overheads, the authors propose ExFusion, a novel approach that initializes the feed-forward network (FFN) in Transformers as a multi-expert structure and dynamically fuses these experts into a single equivalent expert during training via learnable weights, enabling efficient forward computation. ExFusion is the first method to incorporate expert fusion directly within the training process, preserving the performance benefits of MoE while eliminating additional inference latency and memory consumption. Experimental results demonstrate that ExFusion consistently enhances model performance across diverse vision and language tasks with negligible computational overhead.
📝 Abstract
Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert equivalent to the original FFN, which is subsequently used for forward computation. As a result, ExFusion introduces multi-expert characteristics into the training process while incurring only marginal computational cost compared to standard dense training. After training, the learned weights are used to integrate multi-experts into a single unified expert, thereby eliminating additional overhead in storage and deployment. Extensive experiments on a variety of computer vision and natural language processing tasks demonstrate the effectiveness of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
efficient training
parameter overhead
Transformer models
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Transformer
Multi-expert Fusion
Efficient Training
Parameter Integration
🔎 Similar Papers
No similar papers found.