Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of traditional Mixture-of-Experts (MoE) models in time series forecasting—namely, insufficient expert specialization and low training efficiency, often necessitating full model retraining. To overcome these challenges, the authors propose an adaptive MoE framework that explicitly incorporates expert-specific losses into the overall optimization objective for the first time, coupled with a partial online learning mechanism to enable efficient incremental parameter updates. The resulting approach substantially enhances both expert specialization and training efficiency. Empirical evaluations across diverse real-world datasets in economics, tourism, and energy demonstrate that the proposed method consistently outperforms classical statistical baselines as well as state-of-the-art neural architectures such as Transformers and WaveNet in terms of both predictive accuracy and computational efficiency.

📝 Abstract

We propose a novel adaptive Mixture-of-Experts (MoE) framework for time series forecasting that enhances expert specialization by incorporating expert-specific loss information directly into the training process. Notably, the overall objective comprises the base forecasting loss and expert-specific losses, allowing expert-level prediction errors to jointly shape training alongside the global forecasting loss. This framework is further combined with a partial online learning strategy, enabling incremental updates of both the gating mechanism and expert parameters. This approach significantly reduces computational cost by eliminating the need for repeated full model retraining. By integrating expert-level loss awareness with efficient online optimization, the proposed method achieves improved learning efficiency while maintaining strong predictive performance. Empirical results across economic, tourism, and energy datasets with varying frequencies demonstrate that the proposed approach generally outperforms both statistical methods and state-of-the-art neural network models, such as Transformers and WaveNet, in forecasting accuracy and computational efficiency. Furthermore, ablation studies confirm the effectiveness of the expert-specific loss integration strategy, highlighting its contribution to enhancing predictive performance.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Time Series Forecasting

Expert Specialization

Training Efficiency

Online Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

expert-specific loss

online learning