VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper addresses three key challenges in transferring vision models to time-series forecasting: modality discrepancy, differences in modeling multivariate dependencies, and disparities in probabilistic forecasting. To this end, we propose a cross-modal continual pretraining framework. Our method unifies time-series forecasting under an image reconstruction paradigm—eliminating the need for task-specific fine-tuning. Key contributions include: (1) a temporal filtering mechanism built upon vision backbones to enhance pretraining stability; (2) a color-coded multivariate RGB image transformation that explicitly encodes structural inter-variable dependencies; and (3) a parallel multi-quantile reconstruction head enabling flexible distributional modeling. Evaluated on 12 benchmarks—including both in-distribution and out-of-distribution settings—our approach achieves state-of-the-art performance: MSE reductions of 6%–44%, and top-ranked results on nine probabilistic forecasting tasks.

Technology Category

Application Category

📝 Abstract

Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

Problem

Research questions and friction points this paper is trying to address.

Bridging data-modality gap between images and time series

Addressing multivariate-forecasting gap in vision models

Resolving probabilistic-forecasting gap with uncertainty-aware predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-model-based filtering for quality time series data

Colorized multivariate conversion to RGB images

Multi-quantile forecasting with parallel reconstruction heads

🔎 Similar Papers

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters