RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the limitations of current weather forecasting model evaluations, which rely heavily on reanalysis data such as ERA5 and often overestimate performance—particularly during extreme events—due to a lack of realistic operational constraints. To overcome this, the authors propose RealBench, the first real-time evaluation benchmark that entirely eliminates dependence on reanalysis products. RealBench integrates out-of-distribution test sets starting from 2025, low-latency operational analysis data, and ground-truth observations from over ten thousand global stations. The framework introduces specialized metrics tailored to high-impact events like heatwaves, cold spells, and tropical cyclones, establishing an end-to-end, operationally oriented validation platform free from data leakage. Empirical results demonstrate that conventional reanalysis-based benchmarks substantially inflate model performance estimates, underscoring RealBench as a more reliable and operationally relevant foundation for evaluating next-generation AI-driven numerical weather prediction systems.

📝 Abstract

Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.

Problem

Research questions and friction points this paper is trying to address.

weather forecasting

benchmarking

operational conditions

extreme events

reanalysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

RealBench

operational weather forecasting

extreme event evaluation