MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of modeling multi-scale structures—such as global trends, local periodicities, and non-stationarities—in long-term forecasting of real-world multivariate time series. To this end, we propose MoHETS, a Transformer-based encoder architecture that incorporates a sparse Mixture-of-Heterogeneous-Experts (MoHE) mechanism to dynamically route temporal segments to specialized experts: deep convolutional and Fourier basis modules, which respectively capture continuity and periodic patterns. Exogenous information is integrated via covariate cross-attention to enhance robustness against non-stationary dynamics. A lightweight convolutional patch decoder enables flexible prediction horizons of arbitrary length while maintaining parameter efficiency and strong generalization. Evaluated on seven standard multivariate time series benchmarks, MoHETS achieves state-of-the-art performance, reducing average MSE by 12% compared to recent strong baselines and significantly improving long-term forecasting accuracy.

Technology Category

Application Category

📝 Abstract
Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
Problem

Research questions and friction points this paper is trying to address.

long-term time series forecasting
Mixture-of-Experts
heterogeneous temporal dynamics
non-stationary time series
multi-scale structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Heterogeneous-Experts
Long-term Time Series Forecasting
Fourier-based Experts
Depthwise Convolution
Covariate-aware Cross-Attention
🔎 Similar Papers
No similar papers found.
Evandro S. Ortigossa
Evandro S. Ortigossa
Postdoctoral researcher, Weizmann Institute of Science
Data ScienceMachine LearningDeep LearningExplainable Artificial Intelligence (XAI)
G
Guy Lutsker
Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel; Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
Eran Segal
Eran Segal
Professor of Computer Science, Weizmann Institute of Science
Computational biology