It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing time series forecasting benchmarks exhibit limitations in data composition, task formulation, and evaluation perspectives, hindering a comprehensive assessment of foundation models’ generalization capabilities. To address this, this work proposes TIME, a new benchmark comprising 50 high-quality datasets and 98 forecasting tasks. TIME employs a human-in-the-loop pipeline—integrating large language models with expert knowledge—to rigorously prevent data leakage, introduces realistic task definitions grounded in practical scenarios, and adopts a pattern-level evaluation framework based on structural characteristics. Under a strict zero-shot setting, the benchmark conducts multi-granularity evaluations of 12 representative time series foundation models and releases an interactive leaderboard to provide generalizable insights into model performance and capabilities.

Technology Category

Application Category

📝 Abstract

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a rigorous human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 representative TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.

Problem

Research questions and friction points this paper is trying to address.

time series forecasting

benchmark limitations

data integrity

task formulation

evaluation perspective

Innovation

Methods, ideas, or system contributions that make the work stand out.

time series foundation models

zero-shot evaluation

human-in-the-loop benchmark