Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing time series foundation models struggle to effectively integrate multimodal information, limiting their capacity for comprehensive understanding of complex temporal data. To address this challenge, this work proposes the first multimodal pretraining paradigm specifically designed for time series, introducing MM-TS—a billion-scale multimodal time series dataset—and presenting HORAI, a frequency-enhanced cross-modal encoder-decoder architecture that efficiently fuses endogenous modalities (images and text) with exogenous knowledge (e.g., news). The proposed approach achieves state-of-the-art performance in zero-shot settings on both time series forecasting and anomaly detection tasks, significantly enhancing cross-domain generalization capabilities.

Technology Category

Application Category

📝 Abstract
While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.
Problem

Research questions and friction points this paper is trying to address.

time series analysis
multimodal pretraining
foundation models
heterogeneous modalities
model generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal pretraining
time series foundation model
frequency-enhanced architecture
large-scale dataset
zero-shot generalization
🔎 Similar Papers
No similar papers found.