🤖 AI Summary
Existing time-series forecasting methods struggle to effectively integrate natural-language contextual information, and no benchmark rigorously enforces textual context as indispensable for accurate prediction. Method: We introduce CiK—the first benchmark explicitly designed to require textual input for correct solutions—incorporating diverse manually crafted linguistic constraints and domain knowledge. We formally define and quantify “textual context necessity” and propose a lightweight, multi-stage LLM prompting framework that synergizes statistical models, foundation time-series models, and context-aware fine-tuning. Contribution/Results: Our approach achieves state-of-the-art performance on CiK, significantly outperforming existing methods. Empirical analysis reveals both the promise and critical limitations of LLMs in cross-modal temporal reasoning—particularly insufficient logical consistency—thereby establishing a new paradigm for interpretable, accessible multimodal time-series forecasting grounded in rigorous empirical evaluation.
📝 Abstract
Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce"Context is Key"(CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://anon-forecast.github.io/benchmark_report_dev/.