MAD Speech: Measures of Acoustic Diversity of Speech

📅 2024-04-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

216K/year

🤖 AI Summary

Existing generative speech models lack quantifiable, multi-dimensional acoustic diversity evaluation metrics. Method: We propose MAD Speech, a lightweight metric that introduces the first systematic, measurable diversity framework spanning five dimensions—speaker identity (voiceprint), gender, emotion, accent, and background noise. Our approach fuses task-specific embedding models to extract dimension-wise features and quantifies diversity in the embedding space via aggregation functions (e.g., mean pairwise distance). We further construct a benchmark dataset annotated with prior diversity labels to ensure high alignment between automated assessment and ground-truth diversity. Contribution/Results: Experiments demonstrate that MAD Speech significantly outperforms baseline metrics across diverse scenarios, achieving over a 35% improvement in correlation with human diversity annotations. The code and metric will be publicly released.

Technology Category

Application Category

📝 Abstract

Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by developing lightweight metrics of acoustic diversity, which we collectively refer to as MAD Speech. We focus on measuring five facets of acoustic diversity: voice, gender, emotion, accent, and background noise. We construct the metrics as a composition of specialized, per-facet embedding models and an aggregation function that measures diversity within the embedding space. Next, we build a series of datasets with a priori known diversity preferences for each facet. Using these datasets, we demonstrate that our proposed metrics achieve a stronger agreement with the ground-truth diversity than baselines. Finally, we showcase the applicability of our proposed metrics across several real-life evaluation scenarios. MAD Speech will be made publicly accessible.

Problem

Research questions and friction points this paper is trying to address.

Develop metrics to measure acoustic diversity in speech.

Focus on voice, gender, emotion, accent, and background noise.

Validate metrics using datasets with known diversity preferences.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed lightweight metrics for acoustic diversity

Composed specialized embedding models for diversity facets

Created datasets with known diversity for validation

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Research Scientist Intern, Multimodal AI (PhD)