Spectral Text Fusion: A Frequency-Aware Approach to Multimodal Time-Series Forecasting

๐Ÿ“… 2026-02-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of effectively integrating textual context with multimodal time series, a task where existing methods often struggle to balance local alignment and global semantic coherence. To this end, we propose SpecTF, a novel framework that, for the first time, introduces text embeddings into the frequency domain to fuse them with spectral components of time series. SpecTF employs a lightweight cross-attention mechanism to adaptively modulate the weights of different frequency bands and subsequently maps the fused representation back to the time domain via timeโ€“frequency transformation to support downstream prediction. This approach enables multiscale contextual modeling that captures both short-term fluctuations and long-term trends. Extensive experiments demonstrate that SpecTF significantly outperforms state-of-the-art models across multiple multimodal time series benchmarks while substantially reducing model parameters.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal time series forecasting is crucial in real-world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time-series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time-series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short-term changes and long-term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series'spectral components using a lightweight cross-attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state-of-the-art models across diverse multi-modal time series datasets while utilizing considerably fewer parameters. Code is available at https://github.com/hiepnh137/SpecTF.
Problem

Research questions and friction points this paper is trying to address.

multimodal time-series forecasting
text fusion
frequency-aware
spectral decomposition
contextual signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral decomposition
frequency-aware fusion
multimodal time-series forecasting
cross-attention
text-time series alignment
H
Huu Hiep Nguyen
Applied Artificial Intelligence Initiative, Deakin University, Geelong, Australia
M
Minh Hoang Nguyen
Applied Artificial Intelligence Initiative, Deakin University, Geelong, Australia
D
Dung Nguyen
Applied Artificial Intelligence Initiative, Deakin University, Geelong, Australia
Hung Le
Hung Le
Research Lecturer (Assistant Professor), Deakin University
Memory-augmented Neural NetworksMemory-based AgentsDeep Learning