HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing personalized voice activity detection (PVAD) methods require architectural modifications to standard VAD models—such as incorporating FiLM layers—leading to increased deployment complexity and poor generalization across speakers. To address this, we propose a lightweight, backbone-agnostic PVAD framework that enables speaker-conditioned parameter adaptation without altering the base VAD architecture. Specifically, a hypernetwork dynamically generates personalized weights for key layers of the VAD model conditioned on speaker embeddings. This approach fine-tunes only a small subset of parameters while preserving the original model structure, enabling rapid, single-model adaptation to multiple speakers. Evaluated on multiple benchmarks, our method achieves significant improvements in mean average precision (mAP), alongside reduced deployment overhead and computational cost. The framework thus offers enhanced efficiency, flexibility, and practicality for real-world PVAD applications.

Technology Category

Application Category

📝 Abstract
Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.
Problem

Research questions and friction points this paper is trying to address.

Adapts VAD model weights for personalized speaker detection
Enables speaker conditioning without architectural modifications
Improves detection accuracy while maintaining deployment flexibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork modifies selected VAD layers weights
Enables speaker conditioning without architectural changes
Reuses same VAD architecture for different speakers
🔎 Similar Papers
No similar papers found.
M
Mahsa Ghazvini Nejad
Huawei Noah’s Ark Lab, Canada
H
Hamed Jafarzadeh Asl
Huawei Noah’s Ark Lab, Canada
A
Amin Edraki
Huawei Noah’s Ark Lab, Canada
M
Mohammadreza Sadeghi
Huawei Noah’s Ark Lab, Canada
Masoud Asgharian
Masoud Asgharian
Professor, Dept of Math & Stat, McGill University
StatisticsOR/OptimizationML/DNN/LLM
Y
Yuanhao Yu
Huawei Noah’s Ark Lab, Canada
Vahid Partovi Nia
Vahid Partovi Nia
Huawei Noah's Ark Lab and Ecole Polytechnique de Montreal
high-dimensional datastatistical learningdeep learningedge intelligence