Data Quality Issues in Flare Prediction using Machine Learning Models

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current machine learning–based solar flare forecasting suffers from systematic deficiencies and inconsistencies arising from coarse preprocessing of heterogeneous data sources—particularly operational (SWPC) and scientific-grade datasets. Method: This study conducts the first systematic assessment of data quality impacts via data provenance analysis, cross-source consistency verification, and predictive error attribution modeling. Contribution/Results: Quantitative evaluation reveals significant degradation in forecast skill scores (e.g., TS, POD, FAR) due to inter-product discrepancies in temporal resolution, event labeling criteria, and instrumental response—causing up to 12–28% increases in false-alarm and miss rates. We propose a robustness-oriented data correction framework and a standardized preprocessing protocol, accompanied by an operational data selection guideline. These contributions provide a methodological foundation and practical pathway for enhancing reliability, reproducibility, and operational readiness of AI-driven space weather forecasting.

Technology Category

Application Category

📝 Abstract
Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.
Problem

Research questions and friction points this paper is trying to address.

Identifies defects in solar flare prediction data sources
Quantifies how data inconsistencies impair machine learning forecasts
Recommends procedures to fix or mitigate data quality issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies defects in solar flare data sources
Quantifies impacts on machine learning predictions
Recommends procedures to mitigate data quality issues
🔎 Similar Papers
No similar papers found.
K
Ke Hu
Department of Statistics, University of Michigan, Ann Arbor
K
Kevin Jin
Department of Statistics, University of Michigan, Ann Arbor
V
Victor Verma
Department of Statistics, University of Michigan, Ann Arbor
Weihao Liu
Weihao Liu
University of Illinois Chicago
Natural Language Processing
W
Ward Manchester IV
Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor
Lulu Zhao
Lulu Zhao
Beijing University of Posts and Telecommunications
Natural Language Processing
Tamas Gombosi
Tamas Gombosi
Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor
Y
Yang Chen
Department of Statistics, University of Michigan, Ann Arbor