🤖 AI Summary
Traditional Mixture-of-Experts (MoE) models for univariate time series forecasting suffer from complex training procedures, reliance on auxiliary load-balancing losses, and manual router tuning. To address these issues, this paper proposes an attention-inspired gated routing MoE framework. Our approach eliminates the conventional Softmax-based router and explicit load-balancing objectives, instead achieving automatic expert utilization balancing through sparse expert selection and a Transformer-style architecture—thereby significantly simplifying training while preserving parameter and computational efficiency. Evaluated across diverse intermittent and long-/short-horizon datasets from energy, hydrology, and retail domains, our method outperforms state-of-the-art models (e.g., PatchTST) with fewer parameters, achieves higher prediction accuracy, and incurs lower inference overhead than LSTM—demonstrating superior accuracy–efficiency trade-offs.
📝 Abstract
Accurate univariate forecasting remains a pressing need in real-world systems, such as energy markets, hydrology, retail demand, and IoT monitoring, where signals are often intermittent and horizons span both short- and long-term. While transformers and Mixture-of-Experts (MoE) architectures are increasingly favored for time-series forecasting, a key gap persists: MoE models typically require complicated training with both the main forecasting loss and auxiliary load-balancing losses, along with careful routing/temperature tuning, which hinders practical adoption. In this paper, we propose a model architecture that simplifies the training process for univariate time series forecasting and effectively addresses both long- and short-term horizons, including intermittent patterns. Our approach combines sparse MoE computation with a novel attention-inspired gating mechanism that replaces the traditional one-layer softmax router. Through extensive empirical evaluation, we demonstrate that our gating design naturally promotes balanced expert utilization and achieves superior predictive accuracy without requiring the auxiliary load-balancing losses typically used in classical MoE implementations. The model achieves better performance while utilizing only a fraction of the parameters required by state-of-the-art transformer models, such as PatchTST. Furthermore, experiments across diverse datasets confirm that our MoE architecture with the proposed gating mechanism is more computationally efficient than LSTM for both long- and short-term forecasting, enabling cost-effective inference. These results highlight the potential of our approach for practical time-series forecasting applications where both accuracy and computational efficiency are critical.