Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Children’s speech exhibits acoustic and linguistic immaturity, leading to degraded keyword spotting (KWS) performance. To address this, we propose a zero-shot cross-age KWS method that requires no labeled child speech data. Leveraging layer-wise representations from self-supervised learning (SSL) models—including Wav2Vec2, HuBERT, and Data2Vec—we systematically evaluate their generalization capability on child speech wake-word detection. We report the first empirical finding that high-level SSL representations (e.g., Wav2Vec2’s 22nd layer) significantly outperform low-level features and conventional MFCCs, demonstrating strong robustness and cross-age adaptability. Integrated with a Kaldi-DNN classifier, our approach achieves state-of-the-art performance on the PFSTAR and CMU child speech corpora for a 30-word task: ATWV = 0.691, MTWV = 0.7003, false alarm rate = 0.0164, and miss rate = 0.0547—maintaining consistent advantages under noisy conditions.

Technology Category

Application Category

📝 Abstract

Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.

Problem

Research questions and friction points this paper is trying to address.

Enhancing keyword spotting for children's speech using SSL models

Addressing acoustic challenges in zero-shot KWS for child speakers

Improving robustness of keyword detection across age groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise SSL feature extraction for KWS

Zero-shot transfer from adult to children speech

Wav2Vec2 outperforms traditional MFCC baseline

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation