🤖 AI Summary
Existing divisive hierarchical clustering methods often produce dendrograms suffering from improper splits, failure to effectively merge similar clusters, and inconsistency with ground-truth labels. This work reframes the problem from a distributional perspective and proposes replacing traditional set-oriented bipartition criteria with a distributional kernel. By optimizing the total pairwise similarity across all clusters, the method constructs dendrograms with theoretical lower-bound guarantees, thereby overcoming structural limitations inherent in conventional approaches. Extensive experiments on both synthetic and spatial transcriptomics data demonstrate that the resulting dendrograms significantly outperform those generated by existing methods and exhibit strong alignment with biologically meaningful regions.
📝 Abstract
We uncover that current objective-based Divisive Hierarchical Clustering (DHC) methods produce a dendrogram that does not have three desired properties i.e., no unwarranted splitting, group similar clusters into a same subset, ground-truth correspondence. This shortcoming has their root cause in using a set-oriented bisecting assessment criterion. We show that this shortcoming can be addressed by using a distributional kernel, instead of the set-oriented criterion; and the resultant clusters achieve a new distribution-oriented objective to maximize the total similarity of all clusters (TSC). Our theoretical analysis shows that the resultant dendrogram guarantees a lower bound of TSC. The empirical evaluation shows the effectiveness of our proposed method on artificial and Spatial Transcriptomics (bioinformatics) datasets. Our proposed method successfully creates a dendrogram that is consistent with the biological regions in a Spatial Transcriptomics dataset, whereas other contenders fail.