🤖 AI Summary
Existing intent clustering methods rely on costly and opaque commercial large language models (LLMs) and require a pre-specified number of clusters, limiting adaptability to real-world scenarios. To address these limitations, we propose a lightweight, training-free, annotation-free, and cluster-number-agnostic clustering method. Our approach leverages open-source small language models to generate multi-granularity pseudo-labels; clustering is then driven by label-sharing similarity, further enhanced by fusing embedding-based similarity for improved robustness. Notably, this is the first work to introduce pseudo-label generation and a multi-label classification paradigm into unsupervised intent clustering, significantly boosting interpretability and practicality. Evaluated on four benchmark datasets, our method achieves state-of-the-art or competitive performance, demonstrates exceptional cross-model and cross-dataset stability, and remains effective in low-resource settings.
📝 Abstract
In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.