Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current time series forecasting evaluation relies on static data splits, which are prone to test leakage and performance overestimation, failing to capture models’ temporal generalization under dynamic conditions. This work proposes the first open-world dynamic evaluation paradigm for time series forecasting, conducting daily rolling evaluations on continuously evolving streams of GitHub open-source activity data—including issues, pull requests, and stars—to simulate real-world distribution shifts and external perturbations. We construct a highly non-stationary, real-time benchmark from 400 high-star repositories, offering a standardized protocol, a reproducible continuous evaluation framework, and a public online leaderboard. This paradigm shifts the evaluation focus from one-off accuracy toward long-term robustness and stability.

Technology Category

Application Category

📝 Abstract

Recent advances in time-series forecasting increasingly rely on pre-trained foundation-style models. While these models often claim broad generalization, existing evaluation protocols provide limited evidence. Indeed, most current benchmarks use static train-test splits that can easily lead to contamination as foundation models can inadvertently train on test data or perform model selection using test scores, which can inflate performance. We introduce Impermanent, a live benchmark that evaluates forecasting models under open-world temporal change by scoring forecasts sequentially over time on continuously updated data streams, enabling the study of temporal robustness, distributional shift, and performance stability rather than one-off accuracy on a frozen test set. Impermanent is instantiated on GitHub open-source activity, providing a naturally live and highly non-stationary dataset shaped by releases, shifting contributor behavior, platform/tooling changes, and external events. We focus on the top 400 repositories by star count and construct time series from issues opened, pull requests opened, push events, and new stargazers, evaluated over a rolling window with daily updates, alongside standardized protocols and leaderboards for reproducible, ongoing comparison. By shifting evaluation from static accuracy to sustained performance, Impermanent takes a concrete step toward assessing when and whether foundation-level generalization in time-series forecasting can be meaningfully claimed. Code and a live dashboard are available at https://github.com/TimeCopilot/impermanent and https://impermanent.timecopilot.dev.

Problem

Research questions and friction points this paper is trying to address.

temporal generalization

time series forecasting

evaluation benchmark

distributional shift

non-stationarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal generalization

live benchmark

time series forecasting