🤖 AI Summary
Few-shot keyword spotting (KWS) systems suffer from poor open-set generalization under noisy conditions due to reliance on validation-set-dependent detection thresholds. To address this, we propose a Dynamic Time Warping (DTW)-based quantized score calibration method. Our approach normalizes DTW matching scores by jointly performing embedding vector quantization and modeling the quantization error as a prior, thereby decoupling threshold selection from model-specific performance and substantially reducing dependence on validation-set tuning. The key innovation lies in incorporating quantization error as an explicit prior in score calibration, enhancing robustness across diverse noise conditions. Experiments on the KWS-DailyTalk dataset demonstrate that our method improves F1-score by up to 12.3% under high-noise radio channels, while enabling threshold reuse across acoustic environments—significantly improving the practical deployability of few-shot KWS systems.
📝 Abstract
Detecting occurrences of keywords with keyword spotting (KWS) systems requires thresholding continuous detection scores. Selecting appropriate thresholds is a non-trivial task, typically relying on optimizing the performance on a validation dataset. However, such greedy threshold selection often leads to suboptimal performance on unseen data, particularly in varying or noisy acoustic environments or few-shot settings. In this work, we investigate detection threshold estimation for template-based open-set few-shot KWS using dynamic time warping on noisy speech data. To mitigate the performance degradation caused by suboptimal thresholds, we propose a score calibration approach consisting of two different steps: quantizing embeddings and normalizing detection scores using the quantization error prior to thresholding. Experiments on KWS-DailyTalk with simulated high frequency radio channels show that the proposed calibration approach simplifies the choice of detection thresholds and significantly improves the resulting performance.