Position: There are no Champions in Long-Term Time Series Forecasting

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-term time series forecasting suffers from inconsistent benchmarks and non-rigorous evaluation, undermining the robustness of claimed state-of-the-art (SOTA) advances. Method: Through 3,500+ reproducible training runs, cross-benchmark analysis across 14 datasets, sensitivity studies, and statistical significance testing, we systematically demonstrate that purported performance gains of recent complex models are largely artifacts of flawed evaluation protocols—minor variations in experimental setup or metrics readily invert model rankings. Contribution/Results: We expose the fragility of prevailing SOTA claims and propose a trustworthy evaluation paradigm grounded in three pillars: (i) mandatory standardized evaluation protocols, (ii) publicly released, fully reproducible hyperparameter configurations, and (iii) mandatory statistical hypothesis testing. Our findings challenge the community’s emphasis on architectural complexity and advocate a shift toward rigorous validation of methodological robustness and generalizability.

Technology Category

Application Category

📝 Abstract
Recent advances in long-term time series forecasting have introduced numerous complex prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. Our position emphasizes the need to shift focus away from pursuing ever-more complex models and towards enhancing benchmarking practices through rigorous and standardized evaluation methods. To support our claim, we first perform a broad, thorough, and reproducible evaluation of the top-performing models on the most popular benchmark by training 3,500+ networks over 14 datasets. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings suggest the need for rigorous and standardized evaluation methods that enable more substantiated claims, including reproducible hyperparameter setups and statistical testing.
Problem

Research questions and friction points this paper is trying to address.

Inconsistent benchmarking in time series forecasting
Need for standardized evaluation methods
Impact of experimental setups on model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized evaluation methods
Reproducible hyperparameter setups
Rigorous statistical testing
🔎 Similar Papers
No similar papers found.
Lorenzo Brigato
Lorenzo Brigato
Postdoctoral Researcher @ University of Bern
Machine LearningDeep LearningRoboticsArtificial Intelligence
R
Rafael Morand
ARTORG Center, University of Bern, Graduate School for Cellular and Biomedical Sciences, University of Bern
K
Knut Strommen
ARTORG Center, University of Bern, Graduate School for Cellular and Biomedical Sciences, University of Bern
Maria Panagiotou
Maria Panagiotou
University of Bern
Machine Learning
M
Markus Schmidt
Center for Experimental Neurology, Department of Neurology, Bern University Hospital
Stavroula-Georgia Mougiakakou
Stavroula-Georgia Mougiakakou
University of Bern
Artificial IntelligenceMachine LearningComputer VisionBiomedical Engineering