TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Lightweight MLPs suffer from insufficient accuracy in long-term time series forecasting, while Transformer- or CNN-based teacher models incur prohibitive computational and memory overhead. Method: This paper proposes TimeDistill, a cross-architecture knowledge distillation framework. It first reveals the complementary modeling capabilities of heterogeneous architectures across time and frequency domains for multi-scale and multi-period patterns; then introduces a time-frequency feature decoupling distillation strategy. It theoretically formalizes knowledge distillation as a special case of mixup data augmentation to guide the distillation process, and further optimizes the student MLP architecture to enhance its representational capacity. Contribution/Results: Evaluated on eight benchmark datasets, the distilled lightweight MLP achieves an average 18.6% accuracy improvement over its standalone counterpart—and even surpasses its teacher models. Moreover, it attains a 7× speedup in inference latency and reduces parameter count by 130×.

Technology Category

Application Category

📝 Abstract

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

Problem

Research questions and friction points this paper is trying to address.

Efficient long-term time series forecasting

Reducing computational and storage requirements

Enhancing MLP performance via knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLP via knowledge distillation

Cross-architecture pattern transfer

Efficient long-term forecasting

🔎 Similar Papers

Optimizing Time Series Forecasting Architectures: A Hierarchical Neural Architecture Search Approach