TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

๐Ÿ“… 2025-02-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Lightweight MLPs suffer from insufficient accuracy in long-term time series forecasting, while Transformer- or CNN-based teacher models incur prohibitive computational and memory overhead. Method: This paper proposes TimeDistill, a cross-architecture knowledge distillation framework. It first reveals the complementary modeling capabilities of heterogeneous architectures across time and frequency domains for multi-scale and multi-period patterns; then introduces a time-frequency feature decoupling distillation strategy. It theoretically formalizes knowledge distillation as a special case of mixup data augmentation to guide the distillation process, and further optimizes the student MLP architecture to enhance its representational capacity. Contribution/Results: Evaluated on eight benchmark datasets, the distilled lightweight MLP achieves an average 18.6% accuracy improvement over its standalone counterpartโ€”and even surpasses its teacher models. Moreover, it attains a 7ร— speedup in inference latency and reduces parameter count by 130ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.
Problem

Research questions and friction points this paper is trying to address.

Efficient long-term time series forecasting
Reducing computational and storage requirements
Enhancing MLP performance via knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLP via knowledge distillation
Cross-architecture pattern transfer
Efficient long-term forecasting
๐Ÿ”Ž Similar Papers
No similar papers found.
Juntong Ni
Juntong Ni
Emory University
Machine LearningTime Series
Zewen Liu
Zewen Liu
Emory University
Machine LearningGraph Neural NetworksEpidemic Modeling
S
Shiyu Wang
Department of Computer Science, Emory University, Atlanta, United States
M
Ming Jin
School of Information and Communication Technology, Griffith University, Nathan, Australia
W
Wei Jin
Department of Computer Science, Emory University, Atlanta, United States