🤖 AI Summary
Long-term time series forecasting suffers from inconsistent benchmarks and non-rigorous evaluation, undermining the robustness of claimed state-of-the-art (SOTA) advances. Method: Through 3,500+ reproducible training runs, cross-benchmark analysis across 14 datasets, sensitivity studies, and statistical significance testing, we systematically demonstrate that purported performance gains of recent complex models are largely artifacts of flawed evaluation protocols—minor variations in experimental setup or metrics readily invert model rankings. Contribution/Results: We expose the fragility of prevailing SOTA claims and propose a trustworthy evaluation paradigm grounded in three pillars: (i) mandatory standardized evaluation protocols, (ii) publicly released, fully reproducible hyperparameter configurations, and (iii) mandatory statistical hypothesis testing. Our findings challenge the community’s emphasis on architectural complexity and advocate a shift toward rigorous validation of methodological robustness and generalizability.
📝 Abstract
Recent advances in long-term time series forecasting have introduced numerous complex prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. Our position emphasizes the need to shift focus away from pursuing ever-more complex models and towards enhancing benchmarking practices through rigorous and standardized evaluation methods. To support our claim, we first perform a broad, thorough, and reproducible evaluation of the top-performing models on the most popular benchmark by training 3,500+ networks over 14 datasets. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings suggest the need for rigorous and standardized evaluation methods that enable more substantiated claims, including reproducible hyperparameter setups and statistical testing.