Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Current speaker similarity evaluation methods in text-to-speech—relying on automatic speaker verification (ASV) embeddings—overemphasize static acoustic attributes (e.g., timbre, pitch) while neglecting dynamic prosodic patterns such as rhythm, leading to inaccurate speaker identity consistency assessment. This work systematically identifies and characterizes this bias, arguing that robust speaker similarity measurement requires joint modeling of static features and dynamic prosody. To address this, we propose U3D (Utterance-level Dynamic Rhythm Distance), a novel metric that integrates dynamic time warping with rhythm feature extraction to quantify speaker-level rhythmic similarity. Extensive objective and subjective evaluations, along with controlled ablation studies, demonstrate that U3D significantly improves discriminative capability for speaker identity consistency in voice cloning. We publicly release the U3D toolkit and implementation code, establishing a more reliable and interpretable benchmark for speaker similarity evaluation.

Technology Category

Application Category

📝 Abstract

Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.

Problem

Research questions and friction points this paper is trying to address.

Investigates limitations of ASV embeddings in voice identity assessment

Identifies neglect of dynamic rhythm patterns in speaker similarity

Proposes U3D metric to improve dynamic rhythm evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes U3D metric for dynamic rhythm patterns

Analyzes ASV embeddings' focus on static features

Suggests mitigation for speaker similarity confounders

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones