Cross-Talk Speech Reduction, by Separation, for Separation

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of training far-field speech separation models directly from near-field microphone recordings, which are often contaminated by strong crosstalk speech and noise. To this end, the authors introduce the crosstalk suppression (CTR) task and propose CTRnet, a method that leverages real-world paired near-field and far-field mixed speech recordings to extract clean utterances of the target speaker from near-field signals as pseudo-labels. These pseudo-labels are then used to train a far-field speech separation model, termed PuLSS. This approach achieves the first fully end-to-end trainable speech separation system using only in-the-wild data. Evaluated on the CHiME-6 “in-the-wild” scenario, PuLSS significantly outperforms guided source separation methods and sets a new state-of-the-art in automatic speech recognition performance—regardless of whether oracle or estimated speaker diarization is used—effectively bridging the generalization gap between simulated and real-world conditions.

📝 Abstract

In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data.

Problem

Research questions and friction points this paper is trying to address.

Cross-Talk Speech

Speech Separation

Close-Talk Microphone

Far-Field Mixture

Weak Supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Talk Reduction

CTRnet

Pseudo-Label Speech Separation

Real-Recorded Data

Far-Field Speech Separation

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs