π€ AI Summary
This work addresses the limitations of Vision Foundation Models (VFMs)βnamely, their lack of temporal modeling capability and inherent modality gap with spatiotemporal data. To bridge this gap, we propose ST-VFM, the first framework enabling zero-shot, fine-tuning-free reprogramming of VFMs for general spatiotemporal forecasting. Methodologically, ST-VFM employs a dual-branch architecture that jointly processes raw spatiotemporal sequences and optical flow dynamics. It introduces a pre- and post-VFM reprogramming mechanism, incorporating temporal-aware adapters and a bidirectional cross-modal prompting module to enable effective cross-modal spatiotemporal representation learning while keeping the backbone (e.g., DINO, CLIP, DeiT) frozen. Extensive experiments across ten benchmark datasets demonstrate consistent superiority over state-of-the-art methods, validating ST-VFMβs generality, robustness, and adaptability across diverse spatiotemporal prediction tasks.
π Abstract
Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present extbf{ST-VFM}, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a emph{dual-branch architecture} that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The emph{pre-VFM reprogramming} stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The emph{post-VFM reprogramming} stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.