A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the issue of speaker drift—gradual shifts in speaker identity within a single utterance—that degrades speech coherence in diffusion-based text-to-speech synthesis. We propose the first comprehensive framework for automatically detecting such speaker inconsistencies, formulating the problem as a sentence-level binary classification task on speaker consistency. Our approach computes cosine similarities between speaker embeddings of overlapping audio segments and integrates spherical geometric clustering with structured reasoning capabilities of large language models to make robust judgments. A high-quality synthetic evaluation dataset with human-verified annotations is introduced to validate the method across multiple state-of-the-art large language models. Theoretical guarantees are provided for cosine similarity–based detection, establishing speaker drift as a distinct research paradigm with principled foundations.

Technology Category

Application Category

📝 Abstract

Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.

Problem

Research questions and friction points this paper is trying to address.

speaker drift

synthesized speech

text-to-speech

speaker consistency

voice identity

Innovation

Methods, ideas, or system contributions that make the work stand out.

speaker drift detection

diffusion-based TTS

cosine similarity