🤖 AI Summary
This work addresses the challenge of effectively integrating causally influential textual modalities—such as policy announcements and unexpected events—with time series data, a task hindered by the difficulty of explicit modality alignment. To overcome this limitation, the authors propose a novel multimodal time series forecasting framework that eliminates the need for explicit alignment. The approach leverages a large language model (LLM) to generate causal reasoning as a guiding signal, which is then used to lightly augment a time series Transformer architecture. Furthermore, a plug-and-play Multimodal Mixture-of-Experts (MMoE) module is introduced to seamlessly fuse heterogeneous information. Evaluated across 16 real-world multimodal forecasting benchmarks, the method significantly outperforms state-of-the-art baselines while demonstrating strong adaptability and interpretability.
📝 Abstract
Multimodal time series forecasting has garnered significant attention for its potential to provide more accurate predictions than traditional single-modality models by leveraging rich information inherent in other modalities. However, due to fundamental challenges in modality alignment, existing methods often struggle to effectively incorporate multimodal data into predictions, particularly textual information that has a causal influence on time series fluctuations, such as emergency reports and policy announcements. In this paper, we reflect on the role of textual information in numerical forecasting and propose Time series transformers with Multimodal Mixture-of-Experts, TiMi, to unleash the causal reasoning capabilities of LLMs. Concretely, TiMi utilizes LLMs to generate inferences on future developments, which serve as guidance for time series forecasting. To seamlessly integrate both exogenous factors and time series into predictions, we introduce a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment. Experimentally, our proposed TiMi demonstrates consistent state-of-the-art performance on sixteen real-world multimodal forecasting benchmarks, outperforming advanced baselines while offering both strong adaptability and interpretability.