🤖 AI Summary
Current machine learning–based solar flare forecasting suffers from systematic deficiencies and inconsistencies arising from coarse preprocessing of heterogeneous data sources—particularly operational (SWPC) and scientific-grade datasets. Method: This study conducts the first systematic assessment of data quality impacts via data provenance analysis, cross-source consistency verification, and predictive error attribution modeling. Contribution/Results: Quantitative evaluation reveals significant degradation in forecast skill scores (e.g., TS, POD, FAR) due to inter-product discrepancies in temporal resolution, event labeling criteria, and instrumental response—causing up to 12–28% increases in false-alarm and miss rates. We propose a robustness-oriented data correction framework and a standardized preprocessing protocol, accompanied by an operational data selection guideline. These contributions provide a methodological foundation and practical pathway for enhancing reliability, reproducibility, and operational readiness of AI-driven space weather forecasting.
📝 Abstract
Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.