A K-Means, Ward and DBSCAN repeatability study

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the lack of bit-level reproducibility in K-Means, Ward, and DBSCAN clustering algorithms under multithreaded execution. To isolate sources of non-determinism, we decouple each algorithm’s computational stages and analyze their implementations in scikit-learn alongside OpenMP’s numerical determinism guarantees. We identify, for the first time, that K-Means fails to reproduce bit-identical results when using more than two OpenMP threads—due to non-deterministic reduction ordering in parallel distance computations. In contrast, Ward and DBSCAN exhibit higher reproducibility under default configurations. Based on these findings, we propose a stage-wise reproducibility assessment framework for clustering algorithms and derive practical, science-oriented guidelines for achieving reproducible clustering. Our study advances awareness of reproducibility risks in foundational machine learning algorithms and provides both theoretical insights and engineering evidence to support robust algorithm design and trustworthy AI development.

Technology Category

Application Category

📝 Abstract
Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, DBSCAN and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal inconsistent results with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.
Problem

Research questions and friction points this paper is trying to address.

Ensures clustering algorithms produce consistent results
Identifies conditions for repeatability in K-Means, DBSCAN, Ward
Highlights inconsistent K-Means outcomes with multiple threads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed clustering algorithms into fundamental steps
Identified conditions for repeatability at each stage
Used scikit-learn to examine repeatable aspects
🔎 Similar Papers
No similar papers found.
A
Anthony Bertrand
Université Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, F-63000 Clermont–Ferrand, France
Engelbert Mephu Nguifo
Engelbert Mephu Nguifo
professeur d'informatique, University Clermont Auvergne
artificial intelligenceformal concept analysisdata miningmachine learningbioinformatics
Violaine Antoine
Violaine Antoine
UCA LIMOS
D
David R.C. Hill
Université Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, F-63000 Clermont–Ferrand, France