Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering

📅 2023-10-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unsupervised image clustering faces two key challenges: (1) existing features inadequately capture intrinsic image structure, and (2) the absence of fine-grained semantic supervision hinders discrimination of subtle visual differences. To address these, we propose StructCLIP—the first framework integrating grid-level jigsaw pretraining with CLIP’s vision–language alignment capability. StructCLIP constructs a structure-aware proxy task by shuffling and reconstructing image patches, and enforces structural consistency constraints to enable label-free, semantics-aware representation learning. Our method jointly optimizes zero-shot transfer, contrastive learning, and k-means clustering without requiring fine-tuning of the visual backbone. On standard benchmarks—including CIFAR-10, CIFAR-100, and ImageNet-10—StructCLIP achieves up to a 9.2% absolute improvement in clustering accuracy over prior state-of-the-art methods, demonstrating strong generalization across diverse datasets and architectures.
Problem

Research questions and friction points this paper is trying to address.

Enhance image clustering accuracy
Improve semantic label granularity
Accelerate clustering convergence speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrain-based Grid Jigsaw Representation used
Leveraged CLIP for cross-modal feature extraction
Enhanced image clustering with semantic differentiation
🔎 Similar Papers
2024-09-12arXiv.orgCitations: 0