🤖 AI Summary
This work addresses the inherent contradiction in multimodal time-series forecasting: visual modalities lack semantic context, while textual modalities lack fine-grained temporal details. We propose the first unified framework integrating visual, linguistic, and raw time-series features. Our core innovation is the first incorporation of a frozen pre-trained vision-language model (VLM) into this task, achieved through three synergistic modules—retrieval-augmented modeling, temporal image encoding, and descriptive text generation—enabling cross-modal semantic alignment and complementary representation learning. Crucially, the VLM remains frozen, avoiding fine-tuning and thereby significantly enhancing few-shot and zero-shot generalization. Extensive experiments on multiple benchmark datasets demonstrate state-of-the-art performance, with an average 18.7% reduction in prediction error under few-shot and zero-shot settings.
📝 Abstract
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments across diverse datasets demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting.