MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of detecting overlapping multi-keyword utterances in mixed speech under few-shot conditions, this paper proposes MT-HuBERT, a self-supervised pretraining framework. Departing from reliance on labeled data, MT-HuBERT jointly models contextual cues and mixture components via a novel Mix-Training objective, learning separable acoustic unit representations from unlabeled mixed-speech data to enable precise localization of overlapping keywords. Its core innovation lies in the first integration of mixture decomposition modeling into the HuBERT architecture, overcoming the dual bottlenecks of speech overlap handling and label scarcity that limit conventional approaches. Evaluated on the Google Speech Commands v2 benchmark, MT-HuBERT achieves significant improvements over existing state-of-the-art methods under both clean and mixed-speech few-shot settings, demonstrating markedly enhanced generalization capability and robustness to acoustic interference.

Technology Category

Application Category

📝 Abstract
Few-shot keyword spotting aims to detect previously unseen keywords with very limited labeled samples. A pre-training and adaptation paradigm is typically adopted for this task. While effective in clean conditions, most existing approaches struggle with mixed keyword spotting--detecting multiple overlapping keywords within a single utterance--a capability essential for real-world applications. We have previously proposed a pre-training approach based on Mix-Training (MT) to tackle the mixed keyword detection problem and demonstrated its efficiency. However, this approach is fully supervised, unable to utilize vast unlabeled data. To this end, we propose Mix-Training HuBERT (MT-HuBERT), a self-supervised learning (SSL) pre-training framework that implements the MT criterion during pre-training. MT-HuBERT predicts, in a self-supervised manner, the clean acoustic units of each constituent signal from contextual cues, in contrast to predicting compositional patterns of mixed speech. Experiments conducted on the Google Speech Commands (GSC v2) corpus demonstrate that our proposed MT-HuBERT consistently outperforms several state-of-the-art baselines in few-shot KWS tasks under both mixed and clean conditions.
Problem

Research questions and friction points this paper is trying to address.

Detecting unseen keywords with limited labeled data in mixed speech
Improving keyword spotting robustness against overlapping speech signals
Enabling self-supervised learning for few-shot keyword detection systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised mix-training framework for keyword spotting
Predicts clean acoustic units from mixed speech context
Outperforms baselines in few-shot mixed keyword detection
🔎 Similar Papers
No similar papers found.
J
Junming Yuan
School of Computer Science and Technology, Xinjiang University, China
Ying Shi
Ying Shi
Syracuse University
Education PolicyRacial InequalityLabor Economics
D
Dong Wang
Center for Speech and Language Technologies, BNRist, Tsinghua University, China
Lantian Li
Lantian Li
Associate Professor @ Beijing University of Posts and Telecommunications
Speech Information ProcessingDeep Learning
A
Askar Hamdulla
School of Computer Science and Technology, Xinjiang University, China