🤖 AI Summary
This work addresses the limitation of traditional spectral clustering, which is largely confined to single-modality data and unable to exploit rich multimodal semantic information. To bridge this gap, the study introduces pretrained vision-language models into spectral clustering for the first time, proposing a noun-anchored neural tangent kernel approach to construct a semantics-aware affinity matrix. Furthermore, it designs a prompt-guided adaptive diffusion mechanism to enhance cross-modal alignment. Evaluated across 16 benchmark datasets—spanning classical, large-scale, fine-grained, and domain-shift scenarios—the method consistently outperforms current state-of-the-art approaches. This advancement marks a significant step in evolving spectral clustering from a unimodal to a multimodal paradigm, demonstrating superior generalization and robustness.
📝 Abstract
Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.