🤖 AI Summary
To address the insufficient robustness and accuracy in user-defined keyword spotting (KWS), this paper proposes DS-KWS, a two-stage robust wake-word detection framework. The first stage jointly leverages connectionist temporal classification (CTC) modeling and streaming phoneme search to efficiently localize candidate segments. The second stage introduces a Query-by-Text (QbyT) mechanism integrated with a phoneme-matching module for joint phoneme-level and utterance-level verification. Innovatively, we design a dual data augmentation strategy: expanding ASR training data from 460 to 1,460 hours to strengthen acoustic modeling, and training the phoneme matcher on over 155,000 anchor classes to significantly improve discrimination of confusable keywords. On LibriPhrase Hard, DS-KWS achieves an EER of 6.13% and an AUC of 97.85%, substantially outperforming prior state-of-the-art methods. Under zero-shot evaluation on Hey-Snips, it attains a 99.13% recall rate at a false alarm rate of 1 per hour—approaching fully supervised performance.
📝 Abstract
In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13% EER and 97.85% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13% recall at one false alarm per hour.