🤖 AI Summary
To address the quadratic complexity of Transformers—rendering them unscalable for long sequences—and the loss of local details in Mamba due to its fixed-state dimension, this paper proposes SST, a Multi-Scale Mixture-of-Experts model. SST introduces a novel dual-expert architecture integrating Mamba with a Local Window Transformer (LWT), coupled with an input-adaptive long-range/short-range routing mechanism that dynamically fuses global trend and local fluctuation representations. It further incorporates multi-scale time-series decomposition and SSM-driven state modeling to achieve O(L) linear computational complexity and low memory overhead. On both long-horizon and short-horizon time-series forecasting benchmarks, SST achieves state-of-the-art performance while significantly reducing computational cost and GPU memory consumption. The implementation is publicly available.
📝 Abstract
Despite significant progress in time series forecasting, existing forecasters often overlook the heterogeneity between long-range and short-range time series, leading to performance degradation in practical applications. In this work, we highlight the need of distinct objectives tailored to different ranges. We point out that time series can be decomposed into global patterns and local variations, which should be addressed separately in long- and short-range time series. To meet the objectives, we propose a multi-scale hybrid Mamba-Transformer experts model State Space Transformer (SST). SST leverages Mamba as an expert to extract global patterns in coarse-grained long-range time series, and Local Window Transformer (LWT), the other expert to focus on capturing local variations in fine-grained short-range time series. With an input-dependent mechanism, State Space Model (SSM)-based Mamba is able to selectively retain long-term patterns and filter out fluctuations, while LWT employs a local window to enhance locality-awareness capability, thus effectively capturing local variations. To adaptively integrate the global patterns and local variations, a long-short router dynamically adjusts contributions of the two experts. SST achieves superior performance with scaling linearly $O(L)$ on time series length $L$. The comprehensive experiments demonstrate the SST can achieve SOTA results in long-short range time series forecasting while maintaining low memory footprint and computational cost. The code of SST is available at https://github.com/XiongxiaoXu/SST.