🤖 AI Summary
Traditional time-series forecasting evaluation relies on global metrics (e.g., SMAPE), whose hierarchical averaging obscures model performance variations across distinct conditions. To address this, we propose the first aspect-based, fine-grained evaluation framework tailored to forecasting tasks, systematically diagnosing model behavior along critical dimensions—including stationarity, anomaly presence, and forecast horizon. Implemented in Python, the framework enables cross-dimensional performance decomposition and visualization for 24 classical and deep learning models. Experiments across multiple benchmark datasets uncover nontrivial insights: e.g., NHITS excels only in multi-step forecasting, while ETS and Theta demonstrate superior robustness under anomalies. Crucially, model ranking is highly dimension-dependent—highlighting the inadequacy of single-metric evaluation. The open-source toolkit facilitates reproducible, scenario-aware model selection and improvement.
📝 Abstract
Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. While convenient, averaging performance over all samples dilutes relevant information about model behavior under varying conditions. This limitation is especially problematic for time series forecasting, where multiple layers of averaging--across time steps, horizons, and multiple time series in a dataset--can mask relevant performance variations. We address this limitation by proposing ModelRadar, a framework for evaluating univariate time series forecasting models across multiple aspects, such as stationarity, presence of anomalies, or forecasting horizons. We demonstrate the advantages of this framework by comparing 24 forecasting methods, including classical approaches and different machine learning algorithms. NHITS, a state-of-the-art neural network architecture, performs best overall but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, we found that NHITS (and also other neural networks) only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that classical approaches such as ETS or Theta are notably more robust in the presence of anomalies. These and other findings highlight the importance of aspect-based model evaluation for both practitioners and researchers. ModelRadar is available as a Python package.