New Intent Discovery with Pre-training and Contrastive Learning

📅 2022-05-25
🏛️ Annual Meeting of the Association for Computational Linguistics
📈 Citations: 39
Influential: 10
📄 PDF
🤖 AI Summary
New intent discovery aims to automatically identify unknown intent categories from unlabeled user utterances to enable continual expansion of dialogue systems; however, existing approaches heavily rely on large-scale manual annotations, suffer from severe noise in pseudo-labels, and exhibit low clustering efficiency. This paper proposes a multi-task pretraining framework for unsupervised new intent discovery: (1) collaborative pretraining leveraging both external labeled data and massive unlabeled corpora; (2) a novel contrastive loss function that exploits self-supervised signals from unlabeled data to enhance discriminability of semantic representations; and (3) integration with an improved unsupervised/semi-supervised clustering algorithm for high-quality intent discovery. Evaluated on three standard benchmarks, our method significantly outperforms state-of-the-art approaches, achieving substantial gains in accuracy and robustness—particularly under zero-shot and few-shot settings.
📝 Abstract
New intent discovery aims to uncover novel intent categories from user utterances to expand the set of supported intent classes. It is a critical task for the development and service expansion of a practical dialogue system. Despite its importance, this problem remains under-explored in the literature. Existing approaches typically rely on a large amount of labeled utterances and employ pseudo-labeling methods for representation learning and clustering, which are label-intensive, inefficient, and inaccurate. In this paper, we provide new solutions to two important research questions for new intent discovery: (1) how to learn semantic utterance representations and (2) how to better cluster utterances. Particularly, we first propose a multi-task pre-training strategy to leverage rich unlabeled data along with external labeled data for representation learning. Then, we design a new contrastive loss to exploit self-supervisory signals in unlabeled data for clustering. Extensive experiments on three intent recognition benchmarks demonstrate the high effectiveness of our proposed method, which outperforms state-of-the-art methods by a large margin in both unsupervised and semi-supervised scenarios. The source code will be available at https://github.com/zhang-yu-wei/MTP-CLNN.
Problem

Research questions and friction points this paper is trying to address.

Uncover novel intent categories from user utterances
Improve semantic utterance representation learning
Enhance clustering accuracy for intent discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task pre-training for representation learning
Contrastive loss for self-supervised clustering
Leveraging unlabeled and labeled data jointly
🔎 Similar Papers
No similar papers found.
Y
Yuwei Zhang
University of California, San Diego2
Haode Zhang
Haode Zhang
Undergraduate, Shanghai Jiao Tong University
roboticsmachine learning
L
Li-Ming Zhan
Department of Computing, The Hong Kong Polytechnic University, Hong Kong S.A.R.1
X
Xiao-Ming Wu
Department of Computing, The Hong Kong Polytechnic University, Hong Kong S.A.R.1
A
A. Lam
Fano Labs, Hong Kong S.A.R.3