🤖 AI Summary
This work addresses the challenge in unsupervised clustering where imbalanced cluster sizes cause traditional silhouette coefficients to bias cluster number estimation toward large clusters or noise. To mitigate this issue, the authors propose an aggregation strategy based on repeated subsampling, which adaptively combines micro- and macro-averaged silhouette coefficients through a convex combination. A bounded nonlinear smoothing mechanism is introduced to effectively balance the strengths of both averaging schemes. Leveraging finite-sample concentration analysis, the proposed method significantly improves the accuracy of cluster number estimation on both synthetic and real-world datasets, outperforming existing internal validation criteria.
📝 Abstract
Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.