🤖 AI Summary
To address the semantic gap between time-series and textual modalities—where large language models (LLMs) struggle to adapt effectively for time-series forecasting—this paper proposes FiCoTS, a novel framework that repositions the LLM as a “semantic enhancer” rather than a direct predictor. FiCoTS introduces a first-of-its-kind three-level cross-modal interaction paradigm—token-level → feature-level → decision-level—spanning fine-grained to coarse-grained abstraction. It employs dynamic heterogeneous graphs to model multi-source textual dependencies, global cross-attention for precise modality alignment, and a gated fusion network for progressive information integration, thereby bridging semantic discrepancies. Extensive experiments on seven real-world benchmark datasets demonstrate statistically significant improvements over state-of-the-art methods, validating FiCoTS’s superior accuracy, robustness, and generalization capability across diverse forecasting scenarios.
📝 Abstract
Time series forecasting is central to data analysis and web technologies. The recent success of Large Language Models (LLMs) offers significant potential for this field, especially from the cross-modality aspect. Most methods adopt an LLM-as-Predictor paradigm, using LLM as the forecasting backbone and designing modality alignment mechanisms to enable LLM to understand time series data. However, the semantic information in the two modalities of time series and text differs significantly, making it challenging for LLM to fully understand time series data. To mitigate this challenge, our work follows an LLM-as-Enhancer paradigm to fully utilize the advantage of LLM in text understanding, where LLM is only used to encode text modality to complement time series modality. Based on this paradigm, we propose FiCoTS, an LLM-enhanced fine-to-coarse framework for multimodal time series forecasting. Specifically, the framework facilitates progressive cross-modality interaction by three levels in a fine-to-coarse scheme: First, in the token-level modality alignment module, a dynamic heterogeneous graph is constructed to filter noise and align time series patches with text tokens; Second, in the feature-level modality interaction module, a global cross-attention mechanism is introduced to enable each time series variable to connect with relevant textual contexts; Third, in the decision-level modality fusion module, we design a gated network to adaptively fuse the results of the two modalities for robust predictions. These three modules work synergistically to let the two modalities interact comprehensively across three semantic levels, enabling textual information to effectively support temporal prediction. Extensive experiments on seven real-world benchmarks demonstrate that our model achieves state-of-the-art performance. The codes will be released publicly.