Scalable Density-based Clustering with Random Projections

📅 2024-02-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Traditional density-based clustering algorithms (e.g., DBSCAN, OPTICS) suffer from prohibitive computational cost and memory infeasibility on high-dimensional, large-scale datasets. Method: This paper proposes sDBSCAN and sOPTICS—novel scalable variants that integrate theoretically grounded random projections to preserve neighborhood structure under cosine and other distance metrics; further introduces neighborhood-aware pruning and indexing optimizations to efficiently approximate core points and their neighborhoods within modified DBSCAN/OPTICS frameworks. Contribution/Results: The algorithms guarantee provably convergent approximate clustering structures and support interactive hierarchical exploration. Evaluated on million-scale real-world datasets, they achieve speedups of over an order of magnitude versus scikit-learn’s implementations (finishing in minutes), maintain controlled memory footprint, surpass state-of-the-art methods in clustering accuracy, and scale to ultra-large datasets beyond the reach of conventional density-based approaches.

Technology Category

Application Category

📝 Abstract

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $chi^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.

Problem

Research questions and friction points this paper is trying to address.

Scalable density-based clustering in high dimensions

Efficient core point identification with random projections

Extension to multiple distance metrics for clustering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses random projections for scalable density-based clustering

Extends to multiple distances via random kernel features

Significantly faster and more accurate than traditional methods

🔎 Similar Papers

No similar papers found.