🤖 AI Summary
Speaker diarization systems often erroneously split utterances from the same speaker into multiple clusters due to intra-speaker variability induced by emotion, health conditions, or speaking rate. To address this, we propose a style-controllable speech synthesis–based data augmentation method: leveraging a controllable text-to-speech model to generate identity-preserving yet stylistically diverse speech samples; extracting and fusing their speaker embeddings with those of the original utterances to learn style-invariant speaker representations. Our key innovation lies in explicitly integrating controllable speech generation into the diarization pipeline—enabling both diversity enhancement and consistency modeling within the embedding space. Evaluated on a simulated emotional speech dataset and a truncated AMI subset, our approach reduces diarization error rates by 49% and 35%, respectively, demonstrating substantial improvements in clustering robustness under high intra-speaker variability.
📝 Abstract
Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.