Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the core problem of “who spoke what and when” in multi-speaker scenarios—particularly under speech overlap—by proposing the first end-to-end unified framework for joint target speaker extraction (TSE) and personalized voice activity detection (PVAD), without requiring pre-extracted speaker embeddings. Methodologically, it introduces a frame-level cross-attention mechanism to dynamically model speaker representations and designs a scene-aware, differentiated multi-task loss function to jointly optimize TSE and PVAD. By eliminating reliance on fixed speaker embeddings, the framework achieves strong generalizability and scene adaptability. Evaluated on LibriMix and SparseLibriMix benchmarks, it significantly outperforms existing state-of-the-art methods in both TSE and PVAD performance, demonstrating superior robustness under overlapping speech conditions.

Technology Category

Application Category

📝 Abstract
Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets.
Problem

Research questions and friction points this paper is trying to address.

Multi-speaker Separation
Speaker Diarization
Overlapping Speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

USEF-TP
Simultaneous TSE and SD
Advanced Training Method
🔎 Similar Papers
No similar papers found.
Bang Zeng
Bang Zeng
Wuhan University | Duke Kunshan University
Target Speaker ExtractionPersonal Voice Activity Detection
M
Ming Li
School of Computer Science, Wuhan University, Wuhan, China; Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Data Science Research Center, Duke Kunshan University, Kunshan, China