New Intent Discovery with Pre-training and Contrastive Learning

📅 2022-05-25

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 39

✨ Influential: 10

career value

184K/year

🤖 AI Summary

New intent discovery aims to automatically identify unknown intent categories from unlabeled user utterances to enable continual expansion of dialogue systems; however, existing approaches heavily rely on large-scale manual annotations, suffer from severe noise in pseudo-labels, and exhibit low clustering efficiency. This paper proposes a multi-task pretraining framework for unsupervised new intent discovery: (1) collaborative pretraining leveraging both external labeled data and massive unlabeled corpora; (2) a novel contrastive loss function that exploits self-supervised signals from unlabeled data to enhance discriminability of semantic representations; and (3) integration with an improved unsupervised/semi-supervised clustering algorithm for high-quality intent discovery. Evaluated on three standard benchmarks, our method significantly outperforms state-of-the-art approaches, achieving substantial gains in accuracy and robustness—particularly under zero-shot and few-shot settings.

📝 Abstract

New intent discovery aims to uncover novel intent categories from user utterances to expand the set of supported intent classes. It is a critical task for the development and service expansion of a practical dialogue system. Despite its importance, this problem remains under-explored in the literature. Existing approaches typically rely on a large amount of labeled utterances and employ pseudo-labeling methods for representation learning and clustering, which are label-intensive, inefficient, and inaccurate. In this paper, we provide new solutions to two important research questions for new intent discovery: (1) how to learn semantic utterance representations and (2) how to better cluster utterances. Particularly, we first propose a multi-task pre-training strategy to leverage rich unlabeled data along with external labeled data for representation learning. Then, we design a new contrastive loss to exploit self-supervisory signals in unlabeled data for clustering. Extensive experiments on three intent recognition benchmarks demonstrate the high effectiveness of our proposed method, which outperforms state-of-the-art methods by a large margin in both unsupervised and semi-supervised scenarios. The source code will be available at https://github.com/zhang-yu-wei/MTP-CLNN.

Problem

Research questions and friction points this paper is trying to address.

Uncover novel intent categories from user utterances

Improve semantic utterance representation learning

Enhance clustering accuracy for intent discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task pre-training for representation learning

Contrastive loss for self-supervised clustering

Leveraging unlabeled and labeled data jointly

🔎 Similar Papers

No similar papers found.