Adaptive Knowledge Distillation for Device-Directed Speech Detection

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Device-directed speech detection (DDSD) aims to distinguish user-initiated queries to voice assistants from background speech, a critical capability for natural human-computer interaction. This paper proposes an adaptive knowledge distillation framework wherein a frozen large-scale pretrained speech encoder serves as the teacher, and a lightweight student model learns task-specific representations via trainable adapters—supporting both keyword-triggered and keyword-free wake-up scenarios, and compatible with both Transformer and Conformer architectures. The method integrates acoustic representation transfer, adapter-based fine-tuning, and joint optimization, enabling efficient knowledge transfer without increasing inference latency. Experiments demonstrate that, compared to baseline student models, our approach reduces equal error rates by 26% (keyword-based) and 19% (keyword-free), respectively, while significantly improving generalization and robustness across diverse acoustic conditions.

Technology Category

Application Category

📝 Abstract
Device-directed speech detection (DDSD) is a binary classification task that separates the user's queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general representations of an ASR large pre-trained acoustic encoder (teacher). We apply task-specific adapters, on top of the (frozen) teacher encoder, trained jointly with the student model on DDSD. We demonstrate that the proposed adaptive KD outperforms the student model without distillation in the keyword and keyword-free (follow-up) invocations, with an improvement of +26% and +19% in terms of Equal Error Rate, respectively. We also show that this approach generalizes across the transformer and conformer-based model architectures.
Problem

Research questions and friction points this paper is trying to address.

Improving device-directed speech detection accuracy
Enhancing efficiency via adaptive knowledge distillation
Generalizing across transformer and conformer architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive knowledge distillation for DDSD
Task-specific adapters with frozen teacher
Improves accuracy across model architectures
🔎 Similar Papers
No similar papers found.
H
Hyung Gun Chi
Apple
F
Florian Pesce
Apple
W
Wonil Chang
Apple
O
Oggi Rudovic
Apple
A
Arturo Argueta
Apple
S
Stefan Braun
Apple
Vineet Garg
Vineet Garg
Apple
Machine LearningSpeech Recognition
Ahmed Hussen Abdelaziz
Ahmed Hussen Abdelaziz
Apple
Noise Robust Automatic Speech RecognitionAudio-visual ASRAudio-visual Speech enhancement