A Review of Machine Learning Techniques in Imbalanced Data and Future Trends

📅 2023-10-11

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

241K/year

🤖 AI Summary

Class imbalance severely biases model training and undermines evaluation validity across domains. Method: This paper systematically reviews 258 authoritative publications (2003–2023) and proposes the first cross-domain, multi-dimensional taxonomy for imbalance learning—unifying sampling-based methods (e.g., SMOTE, ADASYN), cost-sensitive learning, ensemble techniques (e.g., EasyEnsemble, RUSBoost), deep learning adaptations, and evaluation metrics (F1, G-mean, AUC-PR). Contribution/Results: We construct a full-stack knowledge graph spanning preprocessing, modeling, evaluation, and deployment, and introduce the first evaluation selection guideline tailored to large-scale, real-world imbalanced applications. The framework significantly lowers practical adoption barriers in high-skew domains such as financial risk control and medical diagnosis, while identifying emerging research frontiers—including self-supervised and causal learning integration.

📝 Abstract

For over two decades, detecting rare events has been a challenging task among researchers in the data mining and machine learning domain. Real-life problems inspire researchers to navigate and further improve data processing and algorithmic approaches to achieve effective and computationally efficient methods for imbalanced learning. In this paper, we have collected and reviewed 258 peer-reviewed papers from archival journals and conference papers in an attempt to provide an in-depth review of various approaches in imbalanced learning from technical and application perspectives. This work aims to provide a structured review of methods used to address the problem of imbalanced data in various domains and create a general guideline for researchers in academia or industry who want to dive into the broad field of machine learning using large-scale imbalanced data.

Problem

Research questions and friction points this paper is trying to address.

Reviewing machine learning techniques for imbalanced data

Addressing rare event detection challenges in real-world applications

Providing guidelines for handling large-scale imbalanced datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reviewing 258 papers on imbalanced learning

Analyzing technical and application perspectives

Creating guidelines for large-scale imbalanced data

🔎 Similar Papers

Sample Selection Bias in Machine Learning for Healthcare