Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing time series foundation models struggle to effectively integrate multimodal information, limiting their capacity for comprehensive understanding of complex temporal data. To address this challenge, this work proposes the first multimodal pretraining paradigm specifically designed for time series, introducing MM-TS—a billion-scale multimodal time series dataset—and presenting HORAI, a frequency-enhanced cross-modal encoder-decoder architecture that efficiently fuses endogenous modalities (images and text) with exogenous knowledge (e.g., news). The proposed approach achieves state-of-the-art performance in zero-shot settings on both time series forecasting and anomaly detection tasks, significantly enhancing cross-domain generalization capabilities.

Technology Category

Application Category

📝 Abstract

While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.

Problem

Research questions and friction points this paper is trying to address.

time series analysis

multimodal pretraining

foundation models

heterogeneous modalities

model generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal pretraining

time series foundation model

frequency-enhanced architecture