NILC: Discovering New Intents with LLM-assisted Clustering

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
New Intent Discovery (NID) aims to jointly identify both known and unknown intents from unlabeled user utterances. However, existing cascaded approaches—first embedding then clustering—lack end-to-end co-optimization across stages, and embedding-only clustering often fails to capture fine-grained semantic distinctions. This paper proposes an iterative LLM-assisted clustering framework. Its core contributions are: (1) LLM-driven semantic centroid generation and hard-example rewriting to enhance semantic representation fidelity; and (2) semi-supervised seed-guided initialization coupled with soft must-link constraints to enable tight co-optimization between clustering and semantic understanding. Experiments across six cross-domain datasets demonstrate that our method significantly outperforms state-of-the-art approaches under both unsupervised and semi-supervised settings, achieving substantial gains in intent clustering accuracy.

Technology Category

Application Category

📝 Abstract
New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.
Problem

Research questions and friction points this paper is trying to address.

Recognizing new and known intents from unlabeled user utterances in dialogue systems
Overcoming limitations of cascaded pipeline architectures that lack mutual refinement
Addressing suboptimal clustering performance from embedding-only approaches ignoring textual semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative clustering with LLM-refined centroids and embeddings
LLM-augmented semantic centroids and hard sample rewriting
Semi-supervised seeding and soft must links for accuracy