Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing target speaker extraction methods struggle to simultaneously achieve high speech quality and computational efficiency. To address this challenge, this work proposes Mask2Flow-TSE, a two-stage framework that first performs rapid coarse separation via time-frequency masking and then refines the masked spectrogram in a single step using flow matching to generate high-fidelity speech, thereby avoiding the need to reconstruct speech from noise. By synergistically combining discriminative and generative modeling paradigms, the proposed method attains performance on par with state-of-the-art generative approaches while significantly improving inference speed, all within a compact model size of approximately 85 million parameters.

Technology Category

Application Category

📝 Abstract

Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.

Problem

Research questions and friction points this paper is trying to address.

target speaker extraction

overlapping speech

speech separation

reference utterance

Innovation

Methods, ideas, or system contributions that make the work stand out.

two-stage framework

flow matching

target speaker extraction

time-frequency masking