Delving into Spectral Clustering with Vision-Language Representations

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of traditional spectral clustering, which is largely confined to single-modality data and unable to exploit rich multimodal semantic information. To bridge this gap, the study introduces pretrained vision-language models into spectral clustering for the first time, proposing a noun-anchored neural tangent kernel approach to construct a semantics-aware affinity matrix. Furthermore, it designs a prompt-guided adaptive diffusion mechanism to enhance cross-modal alignment. Evaluated across 16 benchmark datasets—spanning classical, large-scale, fine-grained, and domain-shift scenarios—the method consistently outperforms current state-of-the-art approaches. This advancement marks a significant step in evolving spectral clustering from a unimodal to a multimodal paradigm, demonstrating superior generalization and robustness.

Technology Category

Application Category

📝 Abstract
Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
Problem

Research questions and friction points this paper is trying to address.

spectral clustering
multi-modal representation
vision-language models
cross-modal alignment
affinity matrix
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral Clustering
Vision-Language Representation
Neural Tangent Kernel
Cross-Modal Alignment
Affinity Diffusion
🔎 Similar Papers
B
Bo Peng
Australian Artificial Intelligence Institute, University of Technology Sydney, Australia
Y
Yuanwei Hu
Australian Artificial Intelligence Institute, University of Technology Sydney, Australia
Bo Liu
Bo Liu
University of Technology Sydney
Cyber security and privacyAIwireless communications and networksbroadcasting
Ling Chen
Ling Chen
university of technology sydney
data mining; machine learning
J
Jie Lu
Australian Artificial Intelligence Institute, University of Technology Sydney, Australia
Zhen Fang
Zhen Fang
AAII,University of Technology Sydney
Machine Learning