A K-Means, Ward and DBSCAN repeatability study

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work systematically investigates the lack of bit-level reproducibility in K-Means, Ward, and DBSCAN clustering algorithms under multithreaded execution. To isolate sources of non-determinism, we decouple each algorithm’s computational stages and analyze their implementations in scikit-learn alongside OpenMP’s numerical determinism guarantees. We identify, for the first time, that K-Means fails to reproduce bit-identical results when using more than two OpenMP threads—due to non-deterministic reduction ordering in parallel distance computations. In contrast, Ward and DBSCAN exhibit higher reproducibility under default configurations. Based on these findings, we propose a stage-wise reproducibility assessment framework for clustering algorithms and derive practical, science-oriented guidelines for achieving reproducible clustering. Our study advances awareness of reproducibility risks in foundational machine learning algorithms and provides both theoretical insights and engineering evidence to support robust algorithm design and trustworthy AI development.

Technology Category

Application Category

📝 Abstract

Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, DBSCAN and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal inconsistent results with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.

Problem

Research questions and friction points this paper is trying to address.

Ensures clustering algorithms produce consistent results

Identifies conditions for repeatability in K-Means, DBSCAN, Ward

Highlights inconsistent K-Means outcomes with multiple threads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed clustering algorithms into fundamental steps

Identified conditions for repeatability at each stage

Used scikit-learn to examine repeatable aspects

🔎 Similar Papers

Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE