🤖 AI Summary
This work addresses the scalability bottleneck in high-dimensional multivariate time series forecasting, where conventional Transformers suffer from quadratic computational complexity due to cross-channel attention. To overcome this limitation, the authors propose MICA, a novel architecture that introduces compressed attention along the channel dimension for the first time. Built upon a channel-independent Transformer backbone, MICA incorporates a linear-complexity cross-channel attention module that explicitly captures dynamic inter-channel dependencies. Empirical evaluations demonstrate that MICA reduces prediction error by 5.4% on average across multiple benchmark datasets—reaching up to 25.4% improvement in some cases—significantly outperforming state-of-the-art deep multivariate Transformers and MLP-based models. Moreover, MICA exhibits superior scaling efficiency as both the number of channels and context length increase.
📝 Abstract
Multivariate forecasting with Transformers faces a core scalability challenge: modeling cross-channel dependencies via attention compounds attention's quadratic sequence complexity with quadratic channel scaling, making full cross-channel attention impractical for high-dimensional time series. We propose Multivariate Infini Compressive Attention (MICA), an architectural design to extend channel-independent Transformers to channel-dependent forecasting. By adapting efficient attention techniques from the sequence dimension to the channel dimension, MICA adds a cross-channel attention mechanism to channel-independent backbones that scales linearly with channel count and context length. We evaluate channel-independent Transformer architectures with and without MICA across multiple forecasting benchmarks. MICA reduces forecast error over its channel-independent counterparts by 5.4% on average and up to 25.4% on individual datasets, highlighting the importance of explicit cross-channel modeling. Moreover, models with MICA rank first among deep multivariate Transformer and MLP baselines. MICA models also scale more efficiently with respect to both channel count and context length than Transformer baselines that compute attention across both the temporal and channel dimensions, establishing compressive attention as a practical solution for scalable multivariate forecasting.