๐ค AI Summary
Lightweight MLPs suffer from insufficient accuracy in long-term time series forecasting, while Transformer- or CNN-based teacher models incur prohibitive computational and memory overhead. Method: This paper proposes TimeDistill, a cross-architecture knowledge distillation framework. It first reveals the complementary modeling capabilities of heterogeneous architectures across time and frequency domains for multi-scale and multi-period patterns; then introduces a time-frequency feature decoupling distillation strategy. It theoretically formalizes knowledge distillation as a special case of mixup data augmentation to guide the distillation process, and further optimizes the student MLP architecture to enhance its representational capacity. Contribution/Results: Evaluated on eight benchmark datasets, the distilled lightweight MLP achieves an average 18.6% accuracy improvement over its standalone counterpartโand even surpasses its teacher models. Moreover, it attains a 7ร speedup in inference latency and reduces parameter count by 130ร.
๐ Abstract
Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.