🤖 AI Summary
This study investigates whether and under what conditions textual information can consistently improve time-series forecasting performance. Method: We systematically compare alignment-based and prompting-based multimodal approaches across 14 cross-domain forecasting tasks, decoupling the effects of model architecture from data characteristics for the first time. Contribution/Results: We propose five empirically verifiable conditions for multimodal gain—(i) high-capacity text encoder, (ii) relatively weak unimodal time-series baseline, (iii) semantically appropriate text–time-series alignment strategy, (iv) sufficient training data, and (v) text–time-series modality complementarity—constituting the first validity criterion framework for multimodal time-series forecasting. Ablation-driven attribution analysis reveals that multimodal methods stably outperform their best unimodal baseline only when all five conditions are simultaneously satisfied; otherwise, performance degradation relative to the optimal unimodal baseline may occur.
📝 Abstract
Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Although prior works report gains from multimodal input, we find these effects are not universal across datasets and models, and multimodal methods sometimes do not outperform the strongest unimodal baselines. To understand when textual information helps, we disentangle the effects of model architectural properties and data characteristics. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our empirical findings offer practical guidelines for when multimodality can be expected to aid forecasting tasks, and when it does not.