Highly Imbalanced Regression with Tabular Data in SEP and Other Applications

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses regression tasks on tabular data with extreme target imbalance (imbalance ratio > 1000), exemplified by rare solar energetic particle event intensity prediction. We propose CISIR, a novel framework comprising three synergistic components: (1) a correlation-aware loss function that explicitly optimizes rank consistency between predictions and ground-truth values; (2) a monotonically decreasing involutionary importance function—differentiable, convex, and sparsity-aware—derived from Mean Decrease in Impurity (MDI) feature importance; and (3) hierarchical importance sampling to ensure adequate coverage of rare-value regions. CISIR is model-agnostic and compatible with any regression architecture. Evaluated on five long-tailed regression benchmarks, it significantly reduces MAE and MSE while improving Spearman correlation. Ablation studies confirm the transferability of each component, and empirical results demonstrate that MDI-based weighting outperforms prevailing importance functions.

Technology Category

Application Category

📝 Abstract
We investigate imbalanced regression with tabular data that have an imbalance ratio larger than 1,000 ("highly imbalanced"). Accurately estimating the target values of rare instances is important in applications such as forecasting the intensity of rare harmful Solar Energetic Particle (SEP) events. For regression, the MSE loss does not consider the correlation between predicted and actual values. Typical inverse importance functions allow only convex functions. Uniform sampling might yield mini-batches that do not have rare instances. We propose CISIR that incorporates correlation, Monotonically Decreasing Involution (MDI) importance, and stratified sampling. Based on five datasets, our experimental results indicate that CISIR can achieve lower error and higher correlation than some recent methods. Also, adding our correlation component to other recent methods can improve their performance. Lastly, MDI importance can outperform other importance functions. Our code can be found in https://github.com/Machine-Earning/CISIR.
Problem

Research questions and friction points this paper is trying to address.

Addresses highly imbalanced regression with tabular data
Accurately predicts target values for rare instances
Overcomes limitations of MSE loss and uniform sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Correlation-incorporated loss function for regression
Monotonically Decreasing Involution importance function
Stratified sampling to handle rare instances
🔎 Similar Papers
No similar papers found.
J
Josias K. Moukpe
Department of Electrical Engineering and Computer Science, Florida Institute of Technology, Melbourne, FL, USA
Philip K. Chan
Philip K. Chan
Florida Institute of Technology
Machine LearningData Mining
M
Ming Zhang
Department of Aerospace, Physics and Space Sciences, Florida Institute of Technology, Melbourne, FL, USA