๐ค AI Summary
This work addresses the absence of a unified, comprehensive, and community-recognized evaluation framework for time series foundation models, as existing benchmarks often suffer from outdated data, limited task diversity, inconsistent hyperparameter tuning, and lack of visualization. To bridge this gap, the authors introduce an open-source evaluation framework that integrates novel, non-overlapping datasets; multidimensional forecasting tasks capturing statistical properties such as non-stationarity and seasonality; a standardized hyperparameter optimization protocol; and a TensorBoard-based visualization interface. This framework enables, for the first time, systematic evaluation that simultaneously ensures data recency, task diversity, tuning fairness, and result interpretability, thereby supporting fair, fine-grained, and reproducible performance comparisons between domain-specific models and foundation models, while providing the community with an extensible evaluation infrastructure.
๐ Abstract
Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.