🤖 AI Summary
This paper addresses two key challenges in time-series forecasting foundation models: weak cross-domain generalization and high intra-domain heterogeneity. To this end, we propose a cross-modal zero-shot forecasting paradigm that eliminates the need for time-series pretraining. Our core insight is the intrinsic structural similarity—both local and global—between real-world multivariate time series and natural images; thus, we reformulate time-series forecasting as an image reconstruction task. Leveraging ImageNet-pretrained vision masked autoencoders (ViT-MAE) via lightweight image-to-time-series format mapping, our method achieves zero-shot performance surpassing existing time-series foundation models. With only a single round of fine-tuning, it attains state-of-the-art results on most benchmarks. The code is publicly available. Extensive experiments demonstrate strong cross-domain generalization robustness.
📝 Abstract
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a"free lunch"for TSF and highlight the potential for future cross-modality research. Our code is publicly available at https://github.com/Keytoyze/VisionTS.