DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization of target-speaker automatic speech recognition (ASR) in multi-speaker scenarios—particularly for unseen speakers—this paper proposes a speaker-agnostic conditional ASR framework that requires neither speaker embeddings nor speaker-specific training data. It is the first to explicitly condition Whisper and Branchformer models on binary speaker diarization labels. Key innovations include a frame-level diarization-dependent transformation (FDDT) and a query-key bias (QKb) mechanism, jointly enhancing speaker-aware acoustic modeling. Additionally, a CTC-Whisper hybrid decoding strategy is introduced to improve inference efficiency. Experiments on AMI, NOTSOFAR-1, Libri2Mix, and LibriCSS demonstrate substantial improvements in target-speaker ASR performance, strong generalization to unseen speakers, and no degradation in single-speaker ASR accuracy or robustness.

Technology Category

Application Category

📝 Abstract
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.
Problem

Research questions and friction points this paper is trying to address.

Speech Separation
Machine Recognition
Unseen Speakers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diarization-Conditioned Whisper
Speaker-independent recognition
Target speaker enhancement
🔎 Similar Papers
No similar papers found.
Alexander Polok
Alexander Polok
Brno University of Technology, Faculty of Information Technology
Machine learning
Dominik Klement
Dominik Klement
Brno University of Technology
Automatic Speech RecognitionSpeaker DiarizationMachine Learning
M
M. Kocour
Speech@FIT, Brno University of Technology, Czechia
J
Jiangyu Han
Speech@FIT, Brno University of Technology, Czechia
Federico Landini
Federico Landini
Brno University of Technology
Bolaji Yusuf
Bolaji Yusuf
Researcher, Brno University of Technology
Speech recognitionSpoken term detection
M
Matthew S Wiesner
CLSP, Johns Hopkins University, USA; HLTCOE, Johns Hopkins University, USA
S
S. Khudanpur
CLSP, Johns Hopkins University, USA; HLTCOE, Johns Hopkins University, USA
Jan Cernocky
Jan Cernocky
Brno University of Technology
speech
Lukas Burget
Lukas Burget
Brno University of Technology
Speech processing