🤖 AI Summary
Traditional density-based clustering algorithms (e.g., DBSCAN, OPTICS) suffer from prohibitive computational cost and memory infeasibility on high-dimensional, large-scale datasets.
Method: This paper proposes sDBSCAN and sOPTICS—novel scalable variants that integrate theoretically grounded random projections to preserve neighborhood structure under cosine and other distance metrics; further introduces neighborhood-aware pruning and indexing optimizations to efficiently approximate core points and their neighborhoods within modified DBSCAN/OPTICS frameworks.
Contribution/Results: The algorithms guarantee provably convergent approximate clustering structures and support interactive hierarchical exploration. Evaluated on million-scale real-world datasets, they achieve speedups of over an order of magnitude versus scikit-learn’s implementations (finishing in minutes), maintain controlled memory footprint, surpass state-of-the-art methods in clustering accuracy, and scale to ultra-large datasets beyond the reach of conventional density-based approaches.
📝 Abstract
We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $chi^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.