UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing diffusion models are predominantly limited to unimodal numerical time series modeling, struggling to effectively integrate heterogeneous modalities—such as textual descriptions and temporal timestamps—to enhance multimodal time series forecasting (TSF). To address this, we propose UniDiff, a unified diffusion framework. First, it employs lightweight MLP-based patch embeddings to preserve local temporal dynamics. Second, it introduces a parallel cross-attention fusion module enabling adaptive interaction between textual semantics and temporal structural information. Third, it incorporates a single-step cross-modal fusion mechanism alongside a classifier-free guidance strategy conditioned on multiple sources, thereby decoupling textual and temporal controls to improve flexibility and robustness. Extensive experiments across eight real-world benchmark datasets demonstrate that UniDiff significantly outperforms state-of-the-art methods, establishing new performance records in multimodal TSF.

Technology Category

Application Category

📝 Abstract

As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

UniDiff addresses multimodal time series forecasting with heterogeneous data like texts and timestamps

It integrates cross-modal signals via a unified fusion module with cross-attention

It enhances robustness with a classifier-free guidance mechanism for multi-source conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion framework for multimodal time series forecasting

Cross-attention mechanism integrates timestamps and texts in parallel

Classifier-free guidance enables decoupled control over multimodal conditioning

🔎 Similar Papers

A Survey on Diffusion Models for Time Series and Spatio-Temporal Data