🤖 AI Summary
Existing multivariate time series forecasting methods often rely on fixed inductive biases, neglect inter-variable dependencies, or employ static fusion strategies, thereby struggling to capture dynamic, horizon-dependent temporal relationships. To address this, we propose a tri-modal forecasting framework that jointly models time-domain dynamics, frequency-domain characteristics, and task-specific prompt information. Specifically, we introduce a spectral branch with a gating mechanism for dual-stream feature collaboration; design an adaptive multi-head cross-modal alignment module coupled with residual fusion to dynamically modulate modality-specific weights across prediction horizons; and integrate frequency-aware positional encoding, prompt learning, and a lightweight Transformer architecture to enable effective few-shot training. Evaluated on multiple benchmark datasets, our method achieves average reductions of 3.28% in MSE and 2.29% in MAE. Notably, it maintains superior performance using only 5–10% of the full training data, demonstrating significantly enhanced generalization and long-horizon modeling capability.
📝 Abstract
Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - https://github.com/monaf-chowdhury/T3Time/