UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing time-series foundation models (TSFMs) predominantly adopt unimodal architectures, limiting their ability to leverage ubiquitous multimodal contextual signals—such as visual and textual data—in real-world forecasting scenarios. To address this, we propose the first unified tri-modal (time series + image + text) prompt learning framework for time-series forecasting. Our approach freezes pre-trained TSFMs alongside off-the-shelf vision and language encoders, and introduces modality-specific embeddings coupled with parameter-efficient soft prompt tuning to enable cross-modal collaborative modeling and joint inference. By preserving the generalization capability of frozen foundation models while substantially enhancing inter-modal interaction, our method achieves state-of-the-art performance across multiple mainstream time-series forecasting benchmarks. Crucially, it provides the first systematic empirical validation that multimodal contextual information significantly improves forecasting accuracy.

Technology Category

Application Category

📝 Abstract

Time series forecasting is a foundational task across domains, such as finance, healthcare, and environmental monitoring. While recent advances in Time Series Foundation Models (TSFMs) have demonstrated strong generalisation through large-scale pretraining, existing models operate predominantly in a unimodal setting, ignoring the rich multimodal context, such as visual and textual signals, that often accompanies time series data in real-world scenarios. This paper introduces a novel parameter-efficient multimodal framework, UniCast, that extends TSFMs to jointly leverage time series, vision, and text modalities for enhanced forecasting performance. Our method integrates modality-specific embeddings from pretrained Vision and Text Encoders with a frozen TSFM via soft prompt tuning, enabling efficient adaptation with minimal parameter updates. This design not only preserves the generalisation strength of the foundation model but also enables effective cross-modal interaction. Extensive experiments across diverse time-series forecasting benchmarks demonstrate that UniCast consistently and significantly outperforms all existing TSFM baselines. The findings highlight the critical role of multimodal context in advancing the next generation of general-purpose time series forecasters.

Problem

Research questions and friction points this paper is trying to address.

Extends time series forecasting to leverage multimodal data

Integrates visual and textual signals with time series

Enhances forecasting performance with minimal parameter updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrating time series, vision, text

Parameter-efficient soft prompt tuning for cross-modal interaction

Leverages pretrained encoders with frozen foundation model

🔎 Similar Papers

TimeCMA: Towards LLM-Empowered Multivariate Time Series Forecasting via Cross-Modality Alignment