🤖 AI Summary
This work addresses the prevailing focus on numerical accuracy in time series forecasting while overlooking the evaluation of models’ reasoning capabilities—particularly their ability to interpret cross-channel dependencies, trends, and external events. To bridge this gap, the study introduces the first benchmark specifically designed to assess reasoning capacity in forecasting systems. It leverages a multi-agent collaborative framework with an iterative verification mechanism to generate interpretable reasoning traces and incorporates an LLM-as-a-Judge evaluation protocol alongside reasoning-prompt enhancement techniques. Experiments across ten diverse datasets demonstrate that the proposed approach significantly improves forecasting accuracy of large language models from 40.2% to 56.6%, providing the first empirical evidence of the causal efficacy of reasoning quality on predictive performance and revealing a widespread deficiency in numerical reasoning among off-the-shelf large models.
📝 Abstract
We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. $\sim40.2\%\to56.6\%)$, validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-the-shelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting. Our benchmark is available at: https://tfrbench.github.io