🤖 AI Summary
Standard k-means is prone to biased cluster centroids under outliers or class imbalance, degrading clustering quality and separation. To address this, we propose SC-WKMeans, a silhouette-coefficient-guided adaptive instance-weighted k-means algorithm that dynamically assigns higher weights to samples with high intra-cluster cohesion and low inter-cluster coupling, thereby mitigating outlier interference and imbalance-induced bias. Our key innovations include: (i) a theoretically grounded, silhouette-driven weighting mechanism supporting both macro- and micro-averaging strategies; (ii) guaranteed convergence under mild assumptions; and (iii) efficiency enhancements via hierarchical sampling and scalable approximate computation. Extensive experiments on multiple synthetic and real-world benchmarks demonstrate that SC-WKMeans significantly improves the average silhouette score—outperforming standard k-means and two state-of-the-art weighted variants.
📝 Abstract
Clustering is a fundamental unsupervised learning task with numerous applications across diverse fields. Popular algorithms such as k-means often struggle with outliers or imbalances, leading to distorted centroids and suboptimal partitions. We introduce K-Sil, a silhouette-guided refinement of the k-means algorithm that weights points based on their silhouette scores, prioritizing well-clustered instances while suppressing borderline or noisy regions. The algorithm emphasizes user-specified silhouette aggregation metrics: macro-, micro-averaged or a combination, through self-tuning weighting schemes, supported by appropriate sampling strategies and scalable approximations. These components ensure computational efficiency and adaptability to diverse dataset geometries. Theoretical guarantees establish centroid convergence, and empirical validation on synthetic and real-world datasets demonstrates statistically significant improvements in silhouette scores over k-means and two other instance-weighted k-means variants. These results establish K-Sil as a principled alternative for applications demanding high-quality, well-separated clusters.