🤖 AI Summary
This study addresses the limitations of current weather forecasting model evaluations, which rely heavily on reanalysis data such as ERA5 and often overestimate performance—particularly during extreme events—due to a lack of realistic operational constraints. To overcome this, the authors propose RealBench, the first real-time evaluation benchmark that entirely eliminates dependence on reanalysis products. RealBench integrates out-of-distribution test sets starting from 2025, low-latency operational analysis data, and ground-truth observations from over ten thousand global stations. The framework introduces specialized metrics tailored to high-impact events like heatwaves, cold spells, and tropical cyclones, establishing an end-to-end, operationally oriented validation platform free from data leakage. Empirical results demonstrate that conventional reanalysis-based benchmarks substantially inflate model performance estimates, underscoring RealBench as a more reliable and operationally relevant foundation for evaluating next-generation AI-driven numerical weather prediction systems.
📝 Abstract
Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.