🤖 AI Summary
Time-series forecasting faces challenges in effectively fusing historical numeric sequences with unstructured textual context—particularly in energy, healthcare, and finance domains. To address this, we propose TokenCast, the first framework to leverage large language models (LLMs) as cross-modal bridges: it symbolically discretizes continuous time series into semantically aligned temporal tokens, then jointly embeds them with textual context into a unified semantic space for end-to-end multimodal modeling. Our approach comprises a learnable discrete tokenizer, LLM embedding freezing and targeted fine-tuning, autoregressive generative training, and inverse decoding for reconstruction. Evaluated on multiple real-world benchmark datasets featuring authentic textual context, TokenCast consistently outperforms state-of-the-art methods, demonstrating strong generalization and practical efficacy. This work establishes a novel paradigm for multimodal time-series modeling by unifying numeric and linguistic modalities through LLM-based semantic alignment.
📝 Abstract
Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, an LLM-driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained large language model (LLM), further optimized with autoregressive generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on diverse real-world datasets enriched with contextual features demonstrate the effectiveness and generalizability of TokenCast.