🤖 AI Summary
Existing time series forecasting benchmarks typically assume known context, making it difficult to evaluate an agent’s ability to autonomously discover valid predictive evidence from noisy, heterogeneous documents. To address this limitation, this work proposes the Dr-CiK benchmark, which establishes the first evaluation framework for context discovery in forecasting tasks, requiring agents to actively retrieve, filter, and distill relevant evidence to support predictions. By integrating deep research with time series models and employing context ablation alongside end-to-end evaluation, the study demonstrates the framework’s effectiveness. Experiments reveal that high-quality contextual evidence substantially enhances forecasting performance; however, current agents recover less than 5% of ground-truth evidence, with over 80% of retrieved references being distractors that often degrade prediction accuracy.
📝 Abstract
Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually <5%), are frequently misled by distractors (>80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.