🤖 AI Summary
Existing time-series foundation models (TSFMs) predominantly adopt unimodal architectures, limiting their ability to leverage ubiquitous multimodal contextual signals—such as visual and textual data—in real-world forecasting scenarios. To address this, we propose the first unified tri-modal (time series + image + text) prompt learning framework for time-series forecasting. Our approach freezes pre-trained TSFMs alongside off-the-shelf vision and language encoders, and introduces modality-specific embeddings coupled with parameter-efficient soft prompt tuning to enable cross-modal collaborative modeling and joint inference. By preserving the generalization capability of frozen foundation models while substantially enhancing inter-modal interaction, our method achieves state-of-the-art performance across multiple mainstream time-series forecasting benchmarks. Crucially, it provides the first systematic empirical validation that multimodal contextual information significantly improves forecasting accuracy.
📝 Abstract
Time series forecasting is a foundational task across domains, such as finance, healthcare, and environmental monitoring. While recent advances in Time Series Foundation Models (TSFMs) have demonstrated strong generalisation through large-scale pretraining, existing models operate predominantly in a unimodal setting, ignoring the rich multimodal context, such as visual and textual signals, that often accompanies time series data in real-world scenarios. This paper introduces a novel parameter-efficient multimodal framework, UniCast, that extends TSFMs to jointly leverage time series, vision, and text modalities for enhanced forecasting performance. Our method integrates modality-specific embeddings from pretrained Vision and Text Encoders with a frozen TSFM via soft prompt tuning, enabling efficient adaptation with minimal parameter updates. This design not only preserves the generalisation strength of the foundation model but also enables effective cross-modal interaction. Extensive experiments across diverse time-series forecasting benchmarks demonstrate that UniCast consistently and significantly outperforms all existing TSFM baselines. The findings highlight the critical role of multimodal context in advancing the next generation of general-purpose time series forecasters.