🤖 AI Summary
Unsupervised image clustering faces two key challenges: (1) existing features inadequately capture intrinsic image structure, and (2) the absence of fine-grained semantic supervision hinders discrimination of subtle visual differences. To address these, we propose StructCLIP—the first framework integrating grid-level jigsaw pretraining with CLIP’s vision–language alignment capability. StructCLIP constructs a structure-aware proxy task by shuffling and reconstructing image patches, and enforces structural consistency constraints to enable label-free, semantics-aware representation learning. Our method jointly optimizes zero-shot transfer, contrastive learning, and k-means clustering without requiring fine-tuning of the visual backbone. On standard benchmarks—including CIFAR-10, CIFAR-100, and ImageNet-10—StructCLIP achieves up to a 9.2% absolute improvement in clustering accuracy over prior state-of-the-art methods, demonstrating strong generalization across diverse datasets and architectures.