VarDrop: Enhancing Training Efficiency by Reducing Variate Redundancy in Periodic Time Series Forecasting

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

In multivariate periodic time series forecasting, standard self-attention mechanisms suffer from high computational complexity and excessive GPU memory consumption due to independent variable-wise token embedding. To address this, we propose a *variable-level sparse attention* framework. Our core innovation is the first-ever *k-dominant frequency hashing (k-DFH)*, which enables frequency-domain-driven variable clustering; combined with frequency-aware token ranking and hierarchical sampling, it constructs a lightweight sparse dot-product attention mechanism. The method preserves forecasting accuracy while significantly reducing GPU memory footprint and training time. Extensive experiments on multiple public benchmarks demonstrate consistent superiority over state-of-the-art efficient forecasting models. This work establishes a scalable, frequency-aware paradigm for modeling periodic multivariate time series.

Technology Category

Application Category

📝 Abstract

Variate tokenization, which independently embeds each variate as separate tokens, has achieved remarkable improvements in multivariate time series forecasting. However, employing self-attention with variate tokens incurs a quadratic computational cost with respect to the number of variates, thus limiting its training efficiency for large-scale applications. To address this issue, we propose VarDrop, a simple yet efficient strategy that reduces the token usage by omitting redundant variate tokens during training. VarDrop adaptively excludes redundant tokens within a given batch, thereby reducing the number of tokens used for dot-product attention while preserving essential information. Specifically, we introduce k-dominant frequency hashing (k-DFH), which utilizes the ranked dominant frequencies in the frequency domain as a hash value to efficiently group variate tokens exhibiting similar periodic behaviors. Then, only representative tokens in each group are sampled through stratified sampling. By performing sparse attention with these selected tokens, the computational cost of scaled dot-product attention is significantly alleviated. Experiments conducted on public benchmark datasets demonstrate that VarDrop outperforms existing efficient baselines.

Problem

Research questions and friction points this paper is trying to address.

Multivariate Time Series

Self-Attention Mechanism

Computational Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

VarDrop

k-DFH

Multivariate Time Series Prediction

🔎 Similar Papers

Optimizing Time Series Forecasting Architectures: A Hierarchical Neural Architecture Search Approach