Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses key challenges in user-defined keyword spotting, including low discriminability among confusable words, unstable cross-speaker performance, and high data costs. To tackle these issues, the authors propose DMA-KWS, a novel framework featuring a two-stage matching mechanism that integrates CTC-based streaming phoneme search with a Query-by-Template (QbyT) phoneme matcher to enhance confusability resolution. Personalized keyword representation is improved through multimodal speech-text registration fusion, and a lightweight continual adaptation module is introduced, requiring updates to only 187k parameters. Evaluated on the LibriPhrase Hard subset, the model achieves an AUC of 97.85% and an EER of 6.13%, significantly outperforming existing approaches while remaining feasible for on-device deployment.

📝 Abstract

User-defined keyword spotting (KWS) is crucial for personalized voice interaction, yet existing methods face several challenges: (1) insufficient discriminability among confusable words, (2) performance inconsistency across speakers with varying pronunciations, and (3) high data cost to ensure reliable wake-word performance. In this paper, we introduce DMA-KWS, an efficient and robust framework for user-defined keyword spotting. First, it adopts a dual-stage matching pipeline: CTC decoding with streaming phoneme search to locate candidate segments, followed by QbyT with a phoneme matcher for fine-grained verification, enabling it to better distinguish confusable words. Next, multi-modal enrollment fuses user-specific speech with text embeddings to further improve accuracy for registered users. Finally, a parameter-efficient continual adaptation mechanism performs lightweight updates using synthetic and real data. Extensive experiments demonstrate the superior performance of DMA-KWS. On the LibriPhrase Hard subset, it achieves 97.85% AUC and 6.13% EER, reaching state-of-the-art performance. In speaker-dependent settings, DMA-KWS consistently outperforms text-only enrollment, demonstrating significant performance gains. Moreover, the proposed parameter-efficient fine-tuning mechanism adapts DMA-KWS with only 187k updated parameters, further enhancing KWS performance while ensuring suitability for on-device deployment.

Problem

Research questions and friction points this paper is trying to address.

keyword spotting

user-defined

confusable words

speaker variability

data efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stage matching

multi-modal enrollment

continual adaptation