🤖 AI Summary
Class imbalance severely biases model training and undermines evaluation validity across domains. Method: This paper systematically reviews 258 authoritative publications (2003–2023) and proposes the first cross-domain, multi-dimensional taxonomy for imbalance learning—unifying sampling-based methods (e.g., SMOTE, ADASYN), cost-sensitive learning, ensemble techniques (e.g., EasyEnsemble, RUSBoost), deep learning adaptations, and evaluation metrics (F1, G-mean, AUC-PR). Contribution/Results: We construct a full-stack knowledge graph spanning preprocessing, modeling, evaluation, and deployment, and introduce the first evaluation selection guideline tailored to large-scale, real-world imbalanced applications. The framework significantly lowers practical adoption barriers in high-skew domains such as financial risk control and medical diagnosis, while identifying emerging research frontiers—including self-supervised and causal learning integration.
📝 Abstract
For over two decades, detecting rare events has been a challenging task among researchers in the data mining and machine learning domain. Real-life problems inspire researchers to navigate and further improve data processing and algorithmic approaches to achieve effective and computationally efficient methods for imbalanced learning. In this paper, we have collected and reviewed 258 peer-reviewed papers from archival journals and conference papers in an attempt to provide an in-depth review of various approaches in imbalanced learning from technical and application perspectives. This work aims to provide a structured review of methods used to address the problem of imbalanced data in various domains and create a general guideline for researchers in academia or industry who want to dive into the broad field of machine learning using large-scale imbalanced data.