End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

πŸ“… 2026-01-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the cascading error problem in speaker role differentiation and speech recognition within child–adult conversations by proposing the first end-to-end joint modeling framework that unifies automatic speech recognition (ASR) and speaker role classification on top of the Whisper architecture. The approach integrates serialized label outputs, a lightweight frame-level role discrimination head, silence suppression, and state-machine-guided constrained decoding to enforce structural consistency and precise timestamp alignment, ensuring semantically and structurally valid outputs. Experimental results demonstrate that the proposed method significantly outperforms conventional cascaded systems on two datasets, achieving lower multi-speaker word error rates and competitive speaker role classification accuracy on both Whisper-small and Whisper-large variants.

Technology Category

Application Category

πŸ“ Abstract
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available
Problem

Research questions and friction points this paper is trying to address.

speaker diarization
automatic speech recognition
child-adult interactions
end-to-end modeling
multi-talker transcription
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end joint modeling
speaker role diarization
Whisper architecture
serialized output training
diarization-guided silence suppression
πŸ”Ž Similar Papers
Anfeng Xu
Anfeng Xu
University of Southern California
Speech ProcessingMultimodal AILLMDeep Learning
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
S
Somer L. Bishop
Weill Institute for Neurosciences, University of California, San Francisco, US
C
Catherine Lord
David Geffen School of Medicine, University of California, Los Angeles, US
S
Shrikanth S. Narayanan
Viterbi School of Engineering, University of Southern California, US