Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

📅 2024-12-30

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address the high false-positive rate caused by phonetically confusable words in open-vocabulary custom keyword spotting, this paper proposes a phoneme-level contrastive learning framework. The method constructs a context-agnostic phoneme memory bank to generate high-quality confusable negative samples and introduces a ternary classifier to precisely discriminate hard negatives. It supports multiple enrollment modalities—including audio-only, text-only, and audio-text joint inputs—enabling unified cross-modal modeling. Key technical components include phoneme alignment, memory bank augmentation, and multimodal matching (audio–text and audio–audio). Evaluated on the LibriPhrase benchmark, the framework achieves state-of-the-art performance, significantly reducing false triggers from confusable words while enhancing model robustness and generalization across diverse enrollment conditions.

Technology Category

Application Category

📝 Abstract

User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Custom Keyword Recognition

Confusable Words Distinction

Error Rate Reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phoneme Contrastive Learning

Speech-Text Matching Optimization

Confusable Counterexample Training

🔎 Similar Papers

No similar papers found.