Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly modeling speaker change detection and gender classification in streaming multi-speaker speech translation. We propose the first end-to-end RNN-Transducer architecture that integrates speaker embeddings (x-vectors) to jointly optimize speaker change detection, gender classification, and speech translation—without post-processing. A novel boundary-aware loss function is introduced to improve detection accuracy at speaker change points. The learned speaker and gender representations serve as metadata for zero-shot TTS prompting and provide gender priors for speaker-adaptive TTS. Evaluated on a multi-speaker test set, our method achieves 92.3% F1 score for speaker change detection, 94.7% accuracy for gender classification, and maintains translation performance (BLEU) with no statistically significant degradation. Results demonstrate the effectiveness of low-latency, high-accuracy, unified multi-task modeling for streaming multi-speaker speech translation.

Technology Category

Application Category

📝 Abstract
Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.
Problem

Research questions and friction points this paper is trying to address.

streaming speaker change detection
gender classification in speech
transducer-based speech translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates speaker embeddings
Streaming end-to-end model
Detects speaker change accurately
🔎 Similar Papers
No similar papers found.