Provable Imbalanced Point Clustering

📅 2024-08-26

🏛️ International Conference on Cyber Security Cryptography and Machine Learning

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This paper addresses the $k$-center clustering problem in high-dimensional spaces under severe class imbalance. Existing methods suffer from theoretical gaps and computational inefficiency. To bridge this gap, we propose the first core-set construction method with provable error bounds and introduce Choice Clustering—a novel framework integrating geometric approximation, weighted sampling, and combinatorial optimization. Our approach constructs a lightweight, weighted core-set that drastically reduces data size while preserving a $(1+varepsilon)$-approximation to the original clustering objective. Extensive experiments on real-world image datasets, synthetic benchmarks, and imbalanced real-world data demonstrate that our method achieves an average 12.7% improvement in clustering accuracy, runs 3.8× faster than state-of-the-art baselines, and exhibits strong robustness to class skew. The framework thus establishes a new paradigm for large-scale imbalanced clustering—uniquely combining rigorous theoretical guarantees with practical efficiency.

Technology Category

Application Category

📝 Abstract

We suggest efficient and provable methods to compute an approximation for imbalanced point clustering, that is, fitting $k$-centers to a set of points in $mathbb{R}^d$, for any $d,kgeq 1$. To this end, we utilize emph{coresets}, which, in the context of the paper, are essentially weighted sets of points in $mathbb{R}^d$ that approximate the fitting loss for every model in a given set, up to a multiplicative factor of $1pmvarepsilon$. We provide [Section 3 and Section E in the appendix] experiments that show the empirical contribution of our suggested methods for real images (novel and reference), synthetic data, and real-world data. We also propose choice clustering, which by combining clustering algorithms yields better performance than each one separately.

Problem

Research questions and friction points this paper is trying to address.

Efficient approximation for imbalanced point clustering

Utilizing coresets to approximate fitting loss

Improving clustering performance via choice clustering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes coresets for efficient clustering approximation

Combines clustering algorithms for improved performance

Validates methods with real and synthetic data experiments

🔎 Similar Papers

Interpretable Clustering: A Survey