Interpretable Text-Guided Image Clustering via Iterative Search

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional unsupervised image clustering suffers from semantic ambiguity, yielding non-unique solutions and preventing users from controlling grouping according to specific semantic criteria (e.g., “texture rather than color”). This paper proposes a text-guided interpretable clustering framework that, for the first time, enables intent-aligned fine-grained clustering in a fully unsupervised setting. Our method introduces an iterative concept discovery mechanism that automatically distills interpretable visual concepts semantically consistent with natural language instructions. By synergistically integrating CLIP’s zero-shot visual representation capability, unsupervised clustering objective optimization, and concept distillation, it requires no annotations, fine-tuning, or auxiliary supervision. Evaluated on multiple image clustering and fine-grained classification benchmarks, our approach achieves an average 12.3% improvement in clustering accuracy and provides human-interpretable, traceable intermediate concepts.

Technology Category

Application Category

📝 Abstract
Traditional clustering methods aim to group unlabeled data points based on their similarity to each other. However, clustering, in the absence of additional information, is an ill-posed problem as there may be many different, yet equally valid, ways to partition a dataset. Distinct users may want to use different criteria to form clusters in the same data, e.g. shape v.s. color. Recently introduced text-guided image clustering methods aim to address this ambiguity by allowing users to specify the criteria of interest using natural language instructions. This instruction provides the necessary context and control needed to obtain clusters that are more aligned with the users' intent. We propose a new text-guided clustering approach named ITGC that uses an iterative discovery process, guided by an unsupervised clustering objective, to generate interpretable visual concepts that better capture the criteria expressed in a user's instructions. We report superior performance compared to existing methods across a wide variety of image clustering and fine-grained classification benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Address ambiguity in clustering with text guidance
Improve cluster alignment with user intent via instructions
Generate interpretable visual concepts iteratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided clustering with natural language instructions
Iterative discovery process for visual concepts
Unsupervised clustering objective for alignment
🔎 Similar Papers
No similar papers found.