Scalable Density-based Clustering with Random Projections

📅 2024-02-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Traditional density-based clustering algorithms (e.g., DBSCAN, OPTICS) suffer from prohibitive computational cost and memory infeasibility on high-dimensional, large-scale datasets. Method: This paper proposes sDBSCAN and sOPTICS—novel scalable variants that integrate theoretically grounded random projections to preserve neighborhood structure under cosine and other distance metrics; further introduces neighborhood-aware pruning and indexing optimizations to efficiently approximate core points and their neighborhoods within modified DBSCAN/OPTICS frameworks. Contribution/Results: The algorithms guarantee provably convergent approximate clustering structures and support interactive hierarchical exploration. Evaluated on million-scale real-world datasets, they achieve speedups of over an order of magnitude versus scikit-learn’s implementations (finishing in minutes), maintain controlled memory footprint, surpass state-of-the-art methods in clustering accuracy, and scale to ultra-large datasets beyond the reach of conventional density-based approaches.

Technology Category

Application Category

📝 Abstract
We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $chi^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.
Problem

Research questions and friction points this paper is trying to address.

Scalable density-based clustering in high dimensions
Efficient core point identification with random projections
Extension to multiple distance metrics for clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses random projections for scalable density-based clustering
Extends to multiple distances via random kernel features
Significantly faster and more accurate than traditional methods
🔎 Similar Papers
No similar papers found.