Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the insufficient robustness and accuracy in user-defined keyword spotting (KWS), this paper proposes DS-KWS, a two-stage robust wake-word detection framework. The first stage jointly leverages connectionist temporal classification (CTC) modeling and streaming phoneme search to efficiently localize candidate segments. The second stage introduces a Query-by-Text (QbyT) mechanism integrated with a phoneme-matching module for joint phoneme-level and utterance-level verification. Innovatively, we design a dual data augmentation strategy: expanding ASR training data from 460 to 1,460 hours to strengthen acoustic modeling, and training the phoneme matcher on over 155,000 anchor classes to significantly improve discrimination of confusable keywords. On LibriPhrase Hard, DS-KWS achieves an EER of 6.13% and an AUC of 97.85%, substantially outperforming prior state-of-the-art methods. Under zero-shot evaluation on Hey-Snips, it attains a 99.13% recall rate at a false alarm rate of 1 per hour—approaching fully supervised performance.

Technology Category

Application Category

📝 Abstract

In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13% EER and 97.85% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13% recall at one false alarm per hour.

Problem

Research questions and friction points this paper is trying to address.

Developing robust user-defined keyword spotting with dual-stage framework

Enhancing acoustic model and phoneme matching through data scaling

Improving performance on confusable words and zero-shot scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework combining CTC and QbyT methods

Dual data scaling strategy enhances acoustic model training

Leverages 155k anchor classes to improve phoneme distinction

🔎 Similar Papers

No similar papers found.