Context-Aware Two-Step Training Scheme for Domain Invariant Speech Separation

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech separation models trained on synthetic data suffer significant performance degradation on real-world mixed speech. To address this domain gap, we propose a context-aware two-stage decoupled training framework that disentangles context extraction from source separation, inspired by human auditory processing. Our method leverages phoneme- or word-level contextual representations—extracted from pre-trained self-supervised speech models (wav2vec 2.0 or Whisper)—as domain-invariant supervisory signals, without fine-tuning, thereby enabling robust generalization to real acoustic conditions. The architecture comprises a dedicated context extractor and a separation module, jointly optimized via self-supervised feature distillation and end-to-end differentiable two-stage learning. Experiments demonstrate substantial cross-domain improvements: +2.1 dB in SI-SNR and −18.7% in WER. Ablation studies quantify the contribution of context supervision to domain invariance at 73%.

Technology Category

Application Category

📝 Abstract
Speech separation seeks to isolate individual speech signals from a multi-talk speech mixture. Despite much progress, a system well-trained on synthetic data often experiences performance degradation on out-of-domain data, such as real-world speech mixtures. To address this, we introduce a novel context-aware, two-stage training scheme for speech separation models. In this training scheme, the conventional end-to-end architecture is replaced with a framework that contains a context extractor and a segregator. The two modules are trained step by step to simulate the speech separation process of an auditory system. We evaluate the proposed training scheme through cross-domain experiments on both synthetic and real-world speech mixtures, and demonstrate that our new scheme effectively boosts separation quality across different domains without adaptation, as measured by signal quality metrics and word error rate (WER). Additionally, an ablation study on the real test set highlights that the context information, including phoneme and word representations from pretrained SSL models, serves as effective domain invariant training targets for separation models.
Problem

Research questions and friction points this paper is trying to address.

Improves speech separation across different domains
Addresses performance degradation on real-world data
Introduces context-aware, two-stage training scheme
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware two-stage training scheme
Context extractor and segregator framework
Domain invariant training targets
🔎 Similar Papers
No similar papers found.