Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language-assisted image clustering methods suffer from insufficient inter-class discriminability due to overly high textual feature similarity and are constrained by predefined image-text alignments, limiting the expressive potential of the textual modality. To address these limitations, this work proposes a novel approach that generates more discriminative self-supervised signals by modeling cross-modal relationships and introduces learnable, category-level continuous semantic centers via prompt-based learning to enhance both clustering performance and interpretability. By integrating vision-language models, cross-modal relational modeling, and self-supervised clustering, the proposed method achieves an average improvement of 2.6% over state-of-the-art techniques across eight benchmark datasets, while the learned semantic centers demonstrate strong semantic interpretability.

Technology Category

Application Category

📝 Abstract
Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Language-Assisted Image Clustering
inter-class discriminability
image-text alignment
text modality utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Assisted Image Clustering
Discriminative Relational Signals
Adaptive Semantic Centers
Prompt Learning
Vision-Language Models
🔎 Similar Papers
No similar papers found.
Jun Ma
Jun Ma
Assistant Professor, The Hong Kong University of Science and Technology
RoboticsAutonomous DrivingMotion Planning and ControlOptimization
Xu Zhang
Xu Zhang
Southeast University
Machine Learning
Z
Zhengxing Jiao
School of Computer Science and Engineering, Southeast University, Nanjing, China
Y
Yaxin Hou
School of Computer Science and Engineering, Southeast University, Nanjing, China
Hui Liu
Hui Liu
City University of Hong Kong
AI SecurityTrustworthy AILLMFake News Detection
Junhui Hou
Junhui Hou
Department of Computer Science, City University of Hong Kong
Neural Spatial Computing
Y
Yuheng Jia
School of Computer Science and Engineering, Southeast University, Nanjing, China