Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

📅 2024-09-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the challenge of disentangling speaker-specific acoustic representations in multi-talker automatic speech recognition (MTASR), this paper proposes Speaker-Aware CTC (SACTC), a novel training objective. Building upon the first theoretical analysis revealing CTC’s implicit modeling of speaker-wise temporal separation, SACTC explicitly enforces selective activation of speaker-specific tokens over their corresponding speech frames within a Bayesian risk minimization framework—achieving end-to-end temporal acoustic disentanglement at the encoder level. The method integrates CTC, serialized output training (SOT), and temporally disentangled acoustic embedding modeling. Experiments on MTASR tasks demonstrate that SACTC reduces overall word error rate (WER) by 10% relatively compared to the SOT-CTC baseline, with improvements reaching 15% in low-overlap scenarios. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition. The code is available at https://github.com/kjw11/Speaker-Aware-CTC.

Problem

Research questions and friction points this paper is trying to address.

Multi-Talker Speech Recognition

Speaker Separation

Speech Identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-aware CTC

Multi-talker ASR

Error rate reduction

🔎 Similar Papers

No similar papers found.