Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

In multi-talker scenarios, front-end speaker separation introduces distortions that severely degrade downstream ASR performance—especially when the ASR backend is trained exclusively on clean speech. To address this, we propose a decoupled training paradigm: a front-end deep separation model (e.g., a TasNet variant) is trained independently, while the back-end ASR model (e.g., Conformer) is trained solely on clean speech supervision, thereby eliminating interference from separation artifacts. This work is the first to systematically realize full training decoupling between separation and recognition modules, breaking away from conventional joint optimization or noise-adaptation frameworks. Evaluated on Libri2Mix, SMS-WSJ (single-/six-channel), and LibriCSS, our approach achieves WERs of 5.1%, 7.60%/5.74%, and 2.92%, respectively—setting new state-of-the-art results at the time and demonstrating significant improvements in both robustness and accuracy.

Technology Category

Application Category

📝 Abstract

Despite the tremendous success of automatic speech recognition (ASR) with the introduction of deep learning, its performance is still unsatisfactory in many real-world multi-talker scenarios. Speaker separation excels in separating individual talkers but, as a frontend, it introduces processing artifacts that degrade the ASR backend trained on clean speech. As a result, mainstream robust ASR systems train the backend on noisy speech to avoid processing artifacts. In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. Our decoupled system achieves 5.1% word error rates (WER) on the Libri2Mix dev/test sets, significantly outperforming other multi-talker ASR baselines. Its effectiveness is also demonstrated with the state-of-the-art 7.60%/5.74% WERs on 1-ch and 6-ch SMS-WSJ. Furthermore, on recorded LibriCSS, we achieve the speaker-attributed WER of 2.92%. These state-of-the-art results suggest that decoupling speaker separation and recognition is an effective approach to elevate robust multi-talker ASR.

Problem

Research questions and friction points this paper is trying to address.

Improve multi-talker ASR by decoupling separation and recognition

Address processing artifacts from speaker separation frontends

Achieve robust ASR with clean speech-trained backends

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples speaker separation and ASR training

Trains ASR backend only on clean speech

Achieves state-of-the-art multi-talker WERs

🔎 Similar Papers

No similar papers found.