🤖 AI Summary
Device-directed speech detection (DDSD) aims to distinguish user-initiated queries to voice assistants from background speech, a critical capability for natural human-computer interaction. This paper proposes an adaptive knowledge distillation framework wherein a frozen large-scale pretrained speech encoder serves as the teacher, and a lightweight student model learns task-specific representations via trainable adapters—supporting both keyword-triggered and keyword-free wake-up scenarios, and compatible with both Transformer and Conformer architectures. The method integrates acoustic representation transfer, adapter-based fine-tuning, and joint optimization, enabling efficient knowledge transfer without increasing inference latency. Experiments demonstrate that, compared to baseline student models, our approach reduces equal error rates by 26% (keyword-based) and 19% (keyword-free), respectively, while significantly improving generalization and robustness across diverse acoustic conditions.
📝 Abstract
Device-directed speech detection (DDSD) is a binary classification task that separates the user's queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general representations of an ASR large pre-trained acoustic encoder (teacher). We apply task-specific adapters, on top of the (frozen) teacher encoder, trained jointly with the student model on DDSD. We demonstrate that the proposed adaptive KD outperforms the student model without distillation in the keyword and keyword-free (follow-up) invocations, with an improvement of +26% and +19% in terms of Equal Error Rate, respectively. We also show that this approach generalizes across the transformer and conformer-based model architectures.