TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the prevailing focus on numerical accuracy in time series forecasting while overlooking the evaluation of models’ reasoning capabilities—particularly their ability to interpret cross-channel dependencies, trends, and external events. To bridge this gap, the study introduces the first benchmark specifically designed to assess reasoning capacity in forecasting systems. It leverages a multi-agent collaborative framework with an iterative verification mechanism to generate interpretable reasoning traces and incorporates an LLM-as-a-Judge evaluation protocol alongside reasoning-prompt enhancement techniques. Experiments across ten diverse datasets demonstrate that the proposed approach significantly improves forecasting accuracy of large language models from 40.2% to 56.6%, providing the first empirical evidence of the causal efficacy of reasoning quality on predictive performance and revealing a widespread deficiency in numerical reasoning among off-the-shelf large models.

Technology Category

Application Category

📝 Abstract

We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. $\sim40.2\%\to56.6\%)$, validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-the-shelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting. Our benchmark is available at: https://tfrbench.github.io

Problem

Research questions and friction points this paper is trying to address.

time-series forecasting

reasoning evaluation

forecasting systems

interpretability

LLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning benchmark

time-series forecasting

multi-agent framework