🤖 AI Summary
This work addresses the challenge of jointly modeling speaker change detection and gender classification in streaming multi-speaker speech translation. We propose the first end-to-end RNN-Transducer architecture that integrates speaker embeddings (x-vectors) to jointly optimize speaker change detection, gender classification, and speech translation—without post-processing. A novel boundary-aware loss function is introduced to improve detection accuracy at speaker change points. The learned speaker and gender representations serve as metadata for zero-shot TTS prompting and provide gender priors for speaker-adaptive TTS. Evaluated on a multi-speaker test set, our method achieves 92.3% F1 score for speaker change detection, 94.7% accuracy for gender classification, and maintains translation performance (BLEU) with no statistically significant degradation. Results demonstrate the effectiveness of low-latency, high-accuracy, unified multi-task modeling for streaming multi-speaker speech translation.
📝 Abstract
Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.