VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

📅 2024-08-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in time-series forecasting foundation models: weak cross-domain generalization and high intra-domain heterogeneity. To this end, we propose a cross-modal zero-shot forecasting paradigm that eliminates the need for time-series pretraining. Our core insight is the intrinsic structural similarity—both local and global—between real-world multivariate time series and natural images; thus, we reformulate time-series forecasting as an image reconstruction task. Leveraging ImageNet-pretrained vision masked autoencoders (ViT-MAE) via lightweight image-to-time-series format mapping, our method achieves zero-shot performance surpassing existing time-series foundation models. With only a single round of fine-tuning, it attains state-of-the-art results on most benchmarks. The code is publicly available. Extensive experiments demonstrate strong cross-domain generalization robustness.

Technology Category

Application Category

📝 Abstract
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a"free lunch"for TSF and highlight the potential for future cross-modality research. Our code is publicly available at https://github.com/Keytoyze/VisionTS.
Problem

Research questions and friction points this paper is trying to address.

Bridges image pre-training to time series forecasting.
Achieves zero-shot forecasting without time series adaptation.
Demonstrates intrinsic similarities between images and time series.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual masked autoencoder for forecasting
Reformulate TSF as image reconstruction
Zero-shot and fine-tuning for performance
Mouxiang Chen
Mouxiang Chen
Zhejiang University
debiasinglarge language modelcode generationtime series
Lefei Shen
Lefei Shen
Zhejiang University
Time Series ForecastingDeep Learning
Z
Zhuo Li
State Street Technology (Zhejiang) Ltd
X
Xiaoyun Joy Wang
State Street Technology (Zhejiang) Ltd
J
Jianling Sun
Zhejiang University
C
Chenghao Liu
Salesforce Research Asia