๐ค AI Summary
This work addresses the challenge of effectively integrating textual context with multimodal time series, a task where existing methods often struggle to balance local alignment and global semantic coherence. To this end, we propose SpecTF, a novel framework that, for the first time, introduces text embeddings into the frequency domain to fuse them with spectral components of time series. SpecTF employs a lightweight cross-attention mechanism to adaptively modulate the weights of different frequency bands and subsequently maps the fused representation back to the time domain via timeโfrequency transformation to support downstream prediction. This approach enables multiscale contextual modeling that captures both short-term fluctuations and long-term trends. Extensive experiments demonstrate that SpecTF significantly outperforms state-of-the-art models across multiple multimodal time series benchmarks while substantially reducing model parameters.
๐ Abstract
Multimodal time series forecasting is crucial in real-world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time-series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time-series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short-term changes and long-term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series'spectral components using a lightweight cross-attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state-of-the-art models across diverse multi-modal time series datasets while utilizing considerably fewer parameters. Code is available at https://github.com/hiepnh137/SpecTF.