🤖 AI Summary
Existing image clustering methods heavily rely on multimodal inputs (e.g., LLMs or text-image pairs) and complex training pipelines, hindering deployment in real-world scenarios lacking textual annotations.
Method: We propose a lightweight, unimodal clustering paradigm: only a small linear/MLP clustering head is fine-tuned atop frozen pre-trained ViT or ResNet features; positive sample pairs are constructed via contrastive learning to enable end-to-end unsupervised cluster assignment optimization.
Contribution/Results: We provide the first theoretical proof that text modality is unnecessary under ideal conditions, fully eliminating multimodal dependencies. Our method achieves state-of-the-art accuracy on CIFAR-10/100, STL-10, and ImageNet-10/100-Dogs benchmarks, while reducing training overhead by over 70%. This significantly enhances practicality and scalability for real-world image clustering.
📝 Abstract
Many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.