HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing personalized voice activity detection (PVAD) methods require architectural modifications to standard VAD models—such as incorporating FiLM layers—leading to increased deployment complexity and poor generalization across speakers. To address this, we propose a lightweight, backbone-agnostic PVAD framework that enables speaker-conditioned parameter adaptation without altering the base VAD architecture. Specifically, a hypernetwork dynamically generates personalized weights for key layers of the VAD model conditioned on speaker embeddings. This approach fine-tunes only a small subset of parameters while preserving the original model structure, enabling rapid, single-model adaptation to multiple speakers. Evaluated on multiple benchmarks, our method achieves significant improvements in mean average precision (mAP), alongside reduced deployment overhead and computational cost. The framework thus offers enhanced efficiency, flexibility, and practicality for real-world PVAD applications.

Technology Category

Application Category

📝 Abstract

Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.

Problem

Research questions and friction points this paper is trying to address.

Adapts VAD model weights for personalized speaker detection

Enables speaker conditioning without architectural modifications

Improves detection accuracy while maintaining deployment flexibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork modifies selected VAD layers weights

Enables speaker conditioning without architectural changes

Reuses same VAD architecture for different speakers

🔎 Similar Papers

No similar papers found.