Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness and accuracy in user-defined keyword spotting (KWS), this paper proposes DS-KWS, a two-stage robust wake-word detection framework. The first stage jointly leverages connectionist temporal classification (CTC) modeling and streaming phoneme search to efficiently localize candidate segments. The second stage introduces a Query-by-Text (QbyT) mechanism integrated with a phoneme-matching module for joint phoneme-level and utterance-level verification. Innovatively, we design a dual data augmentation strategy: expanding ASR training data from 460 to 1,460 hours to strengthen acoustic modeling, and training the phoneme matcher on over 155,000 anchor classes to significantly improve discrimination of confusable keywords. On LibriPhrase Hard, DS-KWS achieves an EER of 6.13% and an AUC of 97.85%, substantially outperforming prior state-of-the-art methods. Under zero-shot evaluation on Hey-Snips, it attains a 99.13% recall rate at a false alarm rate of 1 per hour—approaching fully supervised performance.

Technology Category

Application Category

📝 Abstract
In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13% EER and 97.85% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13% recall at one false alarm per hour.
Problem

Research questions and friction points this paper is trying to address.

Developing robust user-defined keyword spotting with dual-stage framework
Enhancing acoustic model and phoneme matching through data scaling
Improving performance on confusable words and zero-shot scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework combining CTC and QbyT methods
Dual data scaling strategy enhances acoustic model training
Leverages 155k anchor classes to improve phoneme distinction
🔎 Similar Papers
No similar papers found.
Z
Zhiqi Ai
Shanghai University, Shanghai, China
H
Han Cheng
Shanghai University, Shanghai, China
Y
Yuxin Wang
Shanghai University, Shanghai, China
S
Shiyi Mu
Shanghai University, Shanghai, China
Shugong Xu
Shugong Xu
Professor at Xi'an Jiaotong-Liverpool University, IEEE Fellow
Machine LearningPattern RecognitionWireless Systems
Y
Yongjin Zhou
Shanghai University, Shanghai, China