Data Quality Issues in Flare Prediction using Machine Learning Models

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Current machine learning–based solar flare forecasting suffers from systematic deficiencies and inconsistencies arising from coarse preprocessing of heterogeneous data sources—particularly operational (SWPC) and scientific-grade datasets. Method: This study conducts the first systematic assessment of data quality impacts via data provenance analysis, cross-source consistency verification, and predictive error attribution modeling. Contribution/Results: Quantitative evaluation reveals significant degradation in forecast skill scores (e.g., TS, POD, FAR) due to inter-product discrepancies in temporal resolution, event labeling criteria, and instrumental response—causing up to 12–28% increases in false-alarm and miss rates. We propose a robustness-oriented data correction framework and a standardized preprocessing protocol, accompanied by an operational data selection guideline. These contributions provide a methodological foundation and practical pathway for enhancing reliability, reproducibility, and operational readiness of AI-driven space weather forecasting.

Technology Category

Application Category

📝 Abstract

Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.

Problem

Research questions and friction points this paper is trying to address.

Identifies defects in solar flare prediction data sources

Quantifies how data inconsistencies impair machine learning forecasts

Recommends procedures to fix or mitigate data quality issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies defects in solar flare data sources

Quantifies impacts on machine learning predictions

Recommends procedures to mitigate data quality issues

🔎 Similar Papers

No similar papers found.