🤖 AI Summary
This study addresses the overreliance on strongly periodic benchmark datasets in current time series forecasting research, which obscures the effectiveness of classical methods and leads to a misjudgment of advances in deep learning models. The authors propose a new evaluation paradigm that categorizes time series by their intrinsic characteristics and systematically compares task-adapted classical baselines—such as ARIMA and ETS—with modern deep learning approaches. To better reflect real-world challenges, they also introduce datasets exhibiting greater complexity and non-stationarity. Their analysis reveals that, on most standard benchmarks, complex deep learning models offer no significant performance advantage over classical methods, thereby exposing a critical bias in the prevailing evaluation framework. This work calls for the establishment of more scientifically rigorous, equitable, and reproducible evaluation standards in time series forecasting research.
📝 Abstract
We argue that the current practice of evaluating AI/ML time-series forecasting models, predominantly on benchmarks characterized by strong, persistent periodicities and seasonalities, obscures real progress by overlooking the performance of efficient classical methods. We demonstrate that these "standard" datasets often exhibit dominant autocorrelation patterns and seasonal cycles that can be effectively captured by simpler linear or statistical models, rendering complex deep learning architectures frequently no more performant than their classical counterparts for these specific data characteristics, and raising questions as to whether any marginal improvements justify the significant increase in computational overhead and model complexity. We call on the community to (I) retire or substantially augment current benchmarks with datasets exhibiting a wider spectrum of non-stationarities, such as structural breaks, time-varying volatility, and concept drift, and less predictable dynamics drawn from diverse real-world domains, and (II) require every deep learning submission to include robust classical and simple baselines, appropriately chosen for the specific characteristics of the downstream tasks' time series. By doing so, we will help ensure that reported gains reflect genuine scientific methodological advances rather than artifacts of benchmark selection favoring models adept at learning repetitive patterns.