🤖 AI Summary
Existing clustering comparison methods struggle to effectively assess the agreement between clustering results containing overlapping clusters and outliers and ground-truth labels, often yielding misleading evaluations due to structural biases. This work presents the first systematic approach to measuring clustering similarity tailored for such complex scenarios, integrating set-matching principles with information-theoretic concepts. The proposed measure is rigorously defined and satisfies several desirable theoretical properties. Comprehensive theoretical analysis and extensive experiments demonstrate its superior robustness and fairness, significantly mitigating the evaluation bias inherent in conventional metrics when confronted with overlapping structures and outliers. This method thus provides a reliable tool for the quantitative comparison of complex clustering outcomes.
📝 Abstract
Clustering algorithms are an essential part of the unsupervised data science ecosystem, and extrinsic evaluation of clustering algorithms requires a method for comparing the detected clustering to a ground truth clustering. In a general setting, the detected and ground truth clusterings may have outliers (objects belonging to no cluster), overlapping clusters (objects may belong to more than one cluster), or both, but methods for comparing these clusterings are currently undeveloped. In this note, we define a pragmatic similarity measure for comparing clusterings with overlaps and outliers, show that it has several desirable properties, and experimentally confirm that it is not subject to several common biases afflicting other clustering comparison measures.