Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing language-assisted image clustering methods suffer from insufficient inter-class discriminability due to overly high textual feature similarity and are constrained by predefined image-text alignments, limiting the expressive potential of the textual modality. To address these limitations, this work proposes a novel approach that generates more discriminative self-supervised signals by modeling cross-modal relationships and introduces learnable, category-level continuous semantic centers via prompt-based learning to enhance both clustering performance and interpretability. By integrating vision-language models, cross-modal relational modeling, and self-supervised clustering, the proposed method achieves an average improvement of 2.6% over state-of-the-art techniques across eight benchmark datasets, while the learned semantic centers demonstrate strong semantic interpretability.

Technology Category

Application Category

📝 Abstract

Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.

Problem

Research questions and friction points this paper is trying to address.

Language-Assisted Image Clustering

inter-class discriminability

image-text alignment

text modality utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Assisted Image Clustering

Discriminative Relational Signals

Adaptive Semantic Centers

Prompt Learning