Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speaker diarization systems often erroneously split utterances from the same speaker into multiple clusters due to intra-speaker variability induced by emotion, health conditions, or speaking rate. To address this, we propose a style-controllable speech synthesis–based data augmentation method: leveraging a controllable text-to-speech model to generate identity-preserving yet stylistically diverse speech samples; extracting and fusing their speaker embeddings with those of the original utterances to learn style-invariant speaker representations. Our key innovation lies in explicitly integrating controllable speech generation into the diarization pipeline—enabling both diversity enhancement and consistency modeling within the embedding space. Evaluated on a simulated emotional speech dataset and a truncated AMI subset, our approach reduces diarization error rates by 49% and 35%, respectively, demonstrating substantial improvements in clustering robustness under high intra-speaker variability.

Technology Category

Application Category

📝 Abstract
Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.
Problem

Research questions and friction points this paper is trying to address.

Reducing intra-speaker variability in diarization systems
Preventing misclassification of same speaker segments
Enhancing robustness against emotional and stylistic variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Style-controllable speech generation model
Augments speech across diverse styles
Blends speaker embeddings from original and generated audio
🔎 Similar Papers
No similar papers found.
M
Miseul Kim
Department of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea
S
Soo Jin Park
Qualcomm Technologies, Inc., San Diego, California, USA
Kyungguen Byun
Kyungguen Byun
Qualcomm Inc.
text-to-speechvoice conversion
Hyeon-Kyeong Shin
Hyeon-Kyeong Shin
Qualcomm Technologies, Inc., San Diego, California, USA
S
Sunkuk Moon
Qualcomm Technologies, Inc., San Diego, California, USA
Shuhua Zhang
Shuhua Zhang
Tsinghua University
audiospeechacousticsdigital signal processingtelecommunications
Erik Visser
Erik Visser
R&D
Machine learning - AISignal processingAutomatic ControlChemometrics