Restoring balance: principled under/oversampling of data for optimal classification

📅 2024-05-15

🏛️ International Conference on Machine Learning

📈 Citations: 6

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Linear classifiers (e.g., SVM) suffer from degraded generalization performance on high-dimensional imbalanced data. Method: We establish a high-dimensional asymptotic theoretical framework and, for the first time, rigorously derive analytical expressions for the generalization error under undersampling and oversampling. Our approach integrates random matrix theory, high-dimensional statistical learning, and unsupervised probabilistic modeling–driven resampling. Contribution/Results: We quantify how resampling efficacy depends on the first- and second-order statistics of the data and the choice of evaluation metric. Crucially, we prove—and empirically verify—that hybrid sampling consistently outperforms either undersampling or oversampling alone. Extensive numerical experiments and evaluations on real-world datasets—including deep neural network features—demonstrate strong agreement between theoretical predictions and empirical results, with substantial improvements in minority-class classification accuracy. This work provides an interpretable, generalizable, and principle-based foundation for data rebalancing in high dimensions.

Technology Category

Application Category

📝 Abstract

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

Problem

Research questions and friction points this paper is trying to address.

Imbalanced datasets

High-dimensional data

Linear classifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-dimensional Data

Linear Classifier Generalization

Sampling Strategies

🔎 Similar Papers

Learning Confidence Bounds for Classification with Imbalanced Data