🤖 AI Summary
Existing time-series forecasting benchmarks suffer from narrow domain coverage, neglect of covariate-aware tasks, statistically unsound aggregation, inadequate infrastructure, and poor pipeline integration. To address these limitations, we introduce TSBench—the first realistic benchmark spanning seven application domains and comprising 100 diverse forecasting tasks (46 with exogenous covariates), enabling comprehensive evaluation across modeling scenarios. We propose a novel task-weighted aggregation method grounded in bootstrap confidence intervals, supporting statistically robust assessment along two complementary dimensions: win rate and skill score. Concurrently, we release fev—a lightweight, open-source Python library—to enhance reproducibility and streamline engineering integration. Extensive evaluation across pre-trained, statistical, and baseline models reveals pronounced task-specific performance variability, underscoring the necessity of fine-grained, domain-aware evaluation. TSBench establishes a reliable, scalable, and extensible standard for future forecasting research and development.
📝 Abstract
Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly given the recent rise of pretrained models. Existing benchmarks often have narrow domain coverage or overlook important real-world settings, such as tasks with covariates. Additionally, their aggregation procedures often lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks also fail to provide infrastructure for consistent evaluation or are too rigid to integrate into existing pipelines. To address these gaps, we propose fev-bench, a benchmark comprising 100 forecasting tasks across seven domains, including 46 tasks with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for benchmarking forecasting models that emphasizes reproducibility and seamless integration with existing workflows. Usingfev, fev-bench employs principled aggregation methods with bootstrapped confidence intervals to report model performance along two complementary dimensions: win rates and skill scores. We report results on fev-bench for various pretrained, statistical and baseline models, and identify promising directions for future research.