Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering

📅 2023-10-27

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Unsupervised image clustering faces two key challenges: (1) existing features inadequately capture intrinsic image structure, and (2) the absence of fine-grained semantic supervision hinders discrimination of subtle visual differences. To address these, we propose StructCLIP—the first framework integrating grid-level jigsaw pretraining with CLIP’s vision–language alignment capability. StructCLIP constructs a structure-aware proxy task by shuffling and reconstructing image patches, and enforces structural consistency constraints to enable label-free, semantics-aware representation learning. Our method jointly optimizes zero-shot transfer, contrastive learning, and k-means clustering without requiring fine-tuning of the visual backbone. On standard benchmarks—including CIFAR-10, CIFAR-100, and ImageNet-10—StructCLIP achieves up to a 9.2% absolute improvement in clustering accuracy over prior state-of-the-art methods, demonstrating strong generalization across diverse datasets and architectures.

Problem

Research questions and friction points this paper is trying to address.

Enhance image clustering accuracy

Improve semantic label granularity

Accelerate clustering convergence speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrain-based Grid Jigsaw Representation used

Leveraged CLIP for cross-modal feature extraction

Enhanced image clustering with semantic differentiation

🔎 Similar Papers

Style Based Clustering of Visual Artworks

2024-09-12arXiv.orgCitations: 0

Authors to Follow