🤖 AI Summary
To address the limited effectiveness of oversampling minority classes in imbalanced classification, this paper proposes Simplicial SMOTE—the first method to incorporate simplicial complex structures from topological data analysis into oversampling. Unlike conventional SMOTE, which generates synthetic samples solely via convex combinations of pairs (i.e., edges), Simplicial SMOTE constructs a geometric simplicial complex from k-nearest neighbors and synthesizes samples within higher-dimensional simplices using barycentric coordinates. This enables more accurate coverage of the underlying data manifold and decision boundary regions, thereby enhancing boundary representation. Furthermore, we generalize prominent strategies—including Borderline-SMOTE, Safe-Level-SMOTE, and ADASYN—into their simplicial counterparts, establishing a unified framework for simplex-based oversampling. Extensive experiments across multiple benchmark datasets demonstrate that Simplicial SMOTE and its variants consistently outperform standard SMOTE and various graph-based baselines, validating both efficacy and generalizability.
📝 Abstract
SMOTE (Synthetic Minority Oversampling Technique) is the established geometric approach to random oversampling to balance classes in the imbalanced learning problem, followed by many extensions. Its idea is to introduce synthetic data points of the minor class, with each new point being the convex combination of an existing data point and one of its k-nearest neighbors. In this paper, by viewing SMOTE as sampling from the edges of a geometric neighborhood graph and borrowing tools from the topological data analysis, we propose a novel technique, Simplicial SMOTE, that samples from the simplices of a geometric neighborhood simplicial complex. A new synthetic point is defined by the barycentric coordinates w.r.t. a simplex spanned by an arbitrary number of data points being sufficiently close rather than a pair. Such a replacement of the geometric data model results in better coverage of the underlying data distribution compared to existing geometric sampling methods and allows the generation of synthetic points of the minority class closer to the majority class on the decision boundary. We experimentally demonstrate that our Simplicial SMOTE outperforms several popular geometric sampling methods, including the original SMOTE. Moreover, we show that simplicial sampling can be easily integrated into existing SMOTE extensions. We generalize and evaluate simplicial extensions of the classic Borderline SMOTE, Safe-level SMOTE, and ADASYN algorithms, all of which outperform their graph-based counterparts.