Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the limitations of Vision Foundation Models (VFMs)—namely, their lack of temporal modeling capability and inherent modality gap with spatiotemporal data. To bridge this gap, we propose ST-VFM, the first framework enabling zero-shot, fine-tuning-free reprogramming of VFMs for general spatiotemporal forecasting. Methodologically, ST-VFM employs a dual-branch architecture that jointly processes raw spatiotemporal sequences and optical flow dynamics. It introduces a pre- and post-VFM reprogramming mechanism, incorporating temporal-aware adapters and a bidirectional cross-modal prompting module to enable effective cross-modal spatiotemporal representation learning while keeping the backbone (e.g., DINO, CLIP, DeiT) frozen. Extensive experiments across ten benchmark datasets demonstrate consistent superiority over state-of-the-art methods, validating ST-VFM’s generality, robustness, and adaptability across diverse spatiotemporal prediction tasks.

Technology Category

Application Category

📝 Abstract

Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present extbf{ST-VFM}, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a emph{dual-branch architecture} that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The emph{pre-VFM reprogramming} stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The emph{post-VFM reprogramming} stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.

Problem

Research questions and friction points this paper is trying to address.

Adapting Vision Foundation Models for spatio-temporal forecasting tasks

Overcoming lack of temporal modeling in Vision Foundation Models

Bridging modality gap between visual and spatio-temporal data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reprogramming Vision Foundation Models for spatio-temporal forecasting

Dual-branch architecture integrates raw and flow inputs

Pre-VFM and post-VFM stages enhance temporal modeling

🔎 Similar Papers

Language Model Empowered Spatio-Temporal Forecasting via Physics-Aware Reprogramming