🤖 AI Summary
This work addresses the challenge of modeling multi-scale structures—such as global trends, local periodicities, and non-stationarities—in long-term forecasting of real-world multivariate time series. To this end, we propose MoHETS, a Transformer-based encoder architecture that incorporates a sparse Mixture-of-Heterogeneous-Experts (MoHE) mechanism to dynamically route temporal segments to specialized experts: deep convolutional and Fourier basis modules, which respectively capture continuity and periodic patterns. Exogenous information is integrated via covariate cross-attention to enhance robustness against non-stationary dynamics. A lightweight convolutional patch decoder enables flexible prediction horizons of arbitrary length while maintaining parameter efficiency and strong generalization. Evaluated on seven standard multivariate time series benchmarks, MoHETS achieves state-of-the-art performance, reducing average MSE by 12% compared to recent strong baselines and significantly improving long-term forecasting accuracy.
📝 Abstract
Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.