AirQualityBench: A Realistic Evaluation Benchmark for Global Air Quality Forecasting

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing air quality forecasting models are typically evaluated on idealized, preprocessed data and struggle to address real-world challenges such as uneven spatial coverage, structured missingness, heterogeneous pollutant scales, and deployment costs inherent in global monitoring networks. This work proposes AirQualityBench—the first global, hourly multi-pollutant forecasting benchmark (2021–2025, 3,720 stations) that preserves original missing patterns, treats missingness as an intrinsic part of the prediction task, avoids artificial imputation, and standardizes evaluation on physical concentration scales. The benchmark integrates diverse spatiotemporal baseline models with emphasis on mask-aware capabilities, scalability, and physical interpretability. Experiments reveal that models excelling on clean datasets exhibit significantly degraded generalization in realistic settings, demonstrating AirQualityBench’s effectiveness in rigorously evaluating model robustness and practical utility.

📝 Abstract

Air-quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring networks: uneven global coverage, structured missingness, heterogeneous pollutant scales, and deployment cost. We introduce \textbf{AirQualityBench}, a global multi-pollutant benchmark designed to evaluate forecasting models under these realistic conditions. The benchmark contains hourly observations from 3,720 monitoring stations over 2021--2025, covers six major pollutants, and preserves provider-native observation masks. Rather than imputing a dense data tensor, AirQualityBench exposes missingness as part of the forecasting problem and reports errors on valid future observations after inverse transformation to physical concentration scales. Evaluating representative spatio-temporal models under this unified protocol shows that strong performance on sanitized datasets does not reliably transfer to global, fragmented monitoring streams. AirQualityBench therefore serves as a realistic testbed for scalable, mask-aware, and physically interpretable air-quality forecasting. All benchmark data, code, evaluation scripts, and baseline implementations are available at \href{https://github.com/Star-Learning/AirQualityBench}{GitHub}.

Problem

Research questions and friction points this paper is trying to address.

air quality forecasting

realistic evaluation

structured missingness

global monitoring networks

heterogeneous pollutant scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

realistic benchmark

structured missingness

mask-aware forecasting