Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high false-positive rate caused by phonetically confusable words in open-vocabulary custom keyword spotting, this paper proposes a phoneme-level contrastive learning framework. The method constructs a context-agnostic phoneme memory bank to generate high-quality confusable negative samples and introduces a ternary classifier to precisely discriminate hard negatives. It supports multiple enrollment modalities—including audio-only, text-only, and audio-text joint inputs—enabling unified cross-modal modeling. Key technical components include phoneme alignment, memory bank augmentation, and multimodal matching (audio–text and audio–audio). Evaluated on the LibriPhrase benchmark, the framework achieves state-of-the-art performance, significantly reducing false triggers from confusable words while enhancing model robustness and generalization across diverse enrollment conditions.

Technology Category

Application Category

📝 Abstract
User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Custom Keyword Recognition
Confusable Words Distinction
Error Rate Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phoneme Contrastive Learning
Speech-Text Matching Optimization
Confusable Counterexample Training
🔎 Similar Papers
No similar papers found.
K
Kewei Li
NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China
H
Hengshun Zhou
iFlytek Research, Hefei, China
Kai Shen
Kai Shen
Associate Professor of Computer Science, University of Rochester
Computer Systems
Yusheng Dai
Yusheng Dai
Monash University
MultimodalSpeech ProcessingComputer Vison
J
Jun Du
NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China