Joint ASR and Speaker Role Tagging with Serialized Output Training

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of speaker-role awareness in automatic speech recognition (ASR) systems for multi-speaker dialogues, this paper proposes an end-to-end speaker-aware ASR method. The core innovation is the first integration of Serialized Output Training (SOT) into the Whisper architecture: speaker-identity tokens are injected into the output sequence, enabling single-pass decoding to jointly produce transcriptions with explicit speaker labels—eliminating conventional cascaded or separate modeling paradigms. The model is end-to-end fine-tuned on real-world multi-speaker dialogue data. Experiments demonstrate a >10% relative reduction in word error rate (WER) over baseline models in multi-speaker scenarios, significantly improving both transcription accuracy and speaker attribution fidelity. This approach establishes a unified, efficient foundation for spoken-language understanding in conversational AI systems.

Technology Category

Application Category

📝 Abstract
Automatic Speech Recognition systems have made significant progress with large-scale pre-trained models. However, most current systems focus solely on transcribing the speech without identifying speaker roles, a function that is critical for conversational AI. In this work, we investigate the use of serialized output training (SOT) for joint ASR and speaker role tagging. By augmenting Whisper with role-specific tokens and fine-tuning it with SOT, we enable the model to generate role-aware transcriptions in a single decoding pass. We compare the SOT approach against a self-supervised previous baseline method on two real-world conversational datasets. Our findings show that this approach achieves more than 10% reduction in multi-talker WER, demonstrating its feasibility as a unified model for speaker-role aware speech transcription.
Problem

Research questions and friction points this paper is trying to address.

Joint ASR and speaker role tagging in conversations
Single-pass decoding for role-aware transcriptions
Reducing multi-talker WER by 10%
Innovation

Methods, ideas, or system contributions that make the work stand out.

Serialized Output Training for joint ASR
Role-specific tokens augment Whisper model
Single-pass decoding for role-aware transcription
🔎 Similar Papers
No similar papers found.