🤖 AI Summary
Speech separation models trained on synthetic data suffer significant performance degradation on real-world mixed speech. To address this domain gap, we propose a context-aware two-stage decoupled training framework that disentangles context extraction from source separation, inspired by human auditory processing. Our method leverages phoneme- or word-level contextual representations—extracted from pre-trained self-supervised speech models (wav2vec 2.0 or Whisper)—as domain-invariant supervisory signals, without fine-tuning, thereby enabling robust generalization to real acoustic conditions. The architecture comprises a dedicated context extractor and a separation module, jointly optimized via self-supervised feature distillation and end-to-end differentiable two-stage learning. Experiments demonstrate substantial cross-domain improvements: +2.1 dB in SI-SNR and −18.7% in WER. Ablation studies quantify the contribution of context supervision to domain invariance at 73%.
📝 Abstract
Speech separation seeks to isolate individual speech signals from a multi-talk speech mixture. Despite much progress, a system well-trained on synthetic data often experiences performance degradation on out-of-domain data, such as real-world speech mixtures. To address this, we introduce a novel context-aware, two-stage training scheme for speech separation models. In this training scheme, the conventional end-to-end architecture is replaced with a framework that contains a context extractor and a segregator. The two modules are trained step by step to simulate the speech separation process of an auditory system. We evaluate the proposed training scheme through cross-domain experiments on both synthetic and real-world speech mixtures, and demonstrate that our new scheme effectively boosts separation quality across different domains without adaptation, as measured by signal quality metrics and word error rate (WER). Additionally, an ablation study on the real test set highlights that the context information, including phoneme and word representations from pretrained SSL models, serves as effective domain invariant training targets for separation models.