Adaptive Knowledge Distillation for Device-Directed Speech Detection

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Device-directed speech detection (DDSD) aims to distinguish user-initiated queries to voice assistants from background speech, a critical capability for natural human-computer interaction. This paper proposes an adaptive knowledge distillation framework wherein a frozen large-scale pretrained speech encoder serves as the teacher, and a lightweight student model learns task-specific representations via trainable adapters—supporting both keyword-triggered and keyword-free wake-up scenarios, and compatible with both Transformer and Conformer architectures. The method integrates acoustic representation transfer, adapter-based fine-tuning, and joint optimization, enabling efficient knowledge transfer without increasing inference latency. Experiments demonstrate that, compared to baseline student models, our approach reduces equal error rates by 26% (keyword-based) and 19% (keyword-free), respectively, while significantly improving generalization and robustness across diverse acoustic conditions.

Technology Category

Application Category

📝 Abstract

Device-directed speech detection (DDSD) is a binary classification task that separates the user's queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general representations of an ASR large pre-trained acoustic encoder (teacher). We apply task-specific adapters, on top of the (frozen) teacher encoder, trained jointly with the student model on DDSD. We demonstrate that the proposed adaptive KD outperforms the student model without distillation in the keyword and keyword-free (follow-up) invocations, with an improvement of +26% and +19% in terms of Equal Error Rate, respectively. We also show that this approach generalizes across the transformer and conformer-based model architectures.

Problem

Research questions and friction points this paper is trying to address.

Improving device-directed speech detection accuracy

Enhancing efficiency via adaptive knowledge distillation

Generalizing across transformer and conformer architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive knowledge distillation for DDSD

Task-specific adapters with frozen teacher

Improves accuracy across model architectures

🔎 Similar Papers

No similar papers found.