Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses binary rare-event detection in tabular data featuring mixed (continuous and categorical) attributes and severe class imbalance—e.g., credit scoring in banking. Method: We propose MGS-GRF, a novel oversampling framework that jointly models continuous and categorical features while respecting domain constraints. It integrates local full-rank covariance kernel density estimation for continuous variables and generalized random forests for categorical variables, ensuring both interpretability and regulatory compliance. Contribution/Results: MGS-GRF is the first method to simultaneously satisfy two critical requirements: (1) synthesized categorical values strictly adhere to empirically observed category combinations, preserving combinatorial consistency; and (2) it explicitly preserves the joint dependency structure between continuous and categorical features. Evaluated on synthetic, public, and proprietary datasets from leading financial institutions, MGS-GRF significantly improves LightGBM’s PR AUC and ROC AUC. Moreover, its transparent, traceable generation process meets financial regulatory standards for explainability and plausibility.

Technology Category

Application Category

📝 Abstract
This study investigates rare event detection on tabular data within binary classification. Standard techniques to handle class imbalance include SMOTE, which generates synthetic samples from the minority class. However, SMOTE is intrinsically designed for continuous input variables. In fact, despite SMOTE-NC-its default extension to handle mixed features (continuous and categorical variables)-very few works propose procedures to synthesize mixed features. On the other hand, many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest. Empirically, contrary to SMOTE-NC, we show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features. We also evaluate the predictive performances of LightGBM classifiers trained on data sets, augmented with synthetic samples from various strategies. Our comparison is performed on simulated and public real-world data sets, as well as on a private data set from a leading financial institution. We observe that synthetic procedures that have the properties of coherence and association display better predictive performances in terms of various predictive metrics (PR and ROC AUC...), with MGS-GRF being the best one. Furthermore, our method exhibits promising results for the private banking application, with development pipeline being compliant with regulatory constraints.
Problem

Research questions and friction points this paper is trying to address.

Detecting rare events in imbalanced binary classification data
Handling mixed features (continuous and categorical) in oversampling
Improving predictive performance in banking customer scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

MGS-GRF oversampling for mixed features
Kernel density estimator for continuous features
Generalized random forest for categorical features
🔎 Similar Papers
No similar papers found.
A
Abdoulaye Sakho
Artefact Research Center, Paris, France
E
Emmanuel Malherbe
Artefact Research Center, Paris, France
C
Carl-Erik Gauthier
Société Générale, Paris, France
Erwan Scornet
Erwan Scornet
Professeur, Sorbonne Université
StatistiqueMachine Learning