VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

📅 2024-08-30

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper addresses two key challenges in time-series forecasting foundation models: weak cross-domain generalization and high intra-domain heterogeneity. To this end, we propose a cross-modal zero-shot forecasting paradigm that eliminates the need for time-series pretraining. Our core insight is the intrinsic structural similarity—both local and global—between real-world multivariate time series and natural images; thus, we reformulate time-series forecasting as an image reconstruction task. Leveraging ImageNet-pretrained vision masked autoencoders (ViT-MAE) via lightweight image-to-time-series format mapping, our method achieves zero-shot performance surpassing existing time-series foundation models. With only a single round of fine-tuning, it attains state-of-the-art results on most benchmarks. The code is publicly available. Extensive experiments demonstrate strong cross-domain generalization robustness.

Technology Category

Application Category

📝 Abstract

Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a"free lunch"for TSF and highlight the potential for future cross-modality research. Our code is publicly available at https://github.com/Keytoyze/VisionTS.

Problem

Research questions and friction points this paper is trying to address.

Bridges image pre-training to time series forecasting.

Achieves zero-shot forecasting without time series adaptation.

Demonstrates intrinsic similarities between images and time series.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual masked autoencoder for forecasting

Reformulate TSF as image reconstruction

Zero-shot and fine-tuning for performance

🔎 Similar Papers

ViTime: A Visual Intelligence-Based Foundation Model for Time Series Forecasting