Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a unified, multidimensional speaker-voice trait representation benchmark for speech recognition and generation systems. We introduce Vox-Profile—the first comprehensive benchmark jointly modeling static attributes (e.g., age, gender, accent) and dynamic attributes (e.g., emotion, prosody). Grounded in speech science and linguistics, Vox-Profile integrates 15+ public datasets and foundational speech models (Whisper, Wav2Vec 2.0, SEER) into a fully dimensional, interpretable multi-task representation framework, enabling cross-dataset standardized alignment and validation via human expert annotations. Experiments demonstrate: (1) a 32% improvement in ASR error attribution accuracy; (2) strong agreement between automated voice generation quality assessment and human ratings (Pearson’s *r* = 0.89); and (3) full open-sourcing of all resources. Vox-Profile fills a critical gap in multidimensional speech representation benchmarks and establishes a new paradigm for interpretable evaluation and optimization of speech systems.

Technology Category

Application Category

📝 Abstract
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.
Problem

Research questions and friction points this paper is trying to address.

Characterizing diverse speaker and speech traits holistically
Benchmarking speech foundation models for static and dynamic properties
Enhancing ASR and speech generation system evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for speaker and speech traits
Multi-dimensional profiles with static and dynamic properties
Integration of speech science and linguistic expertise
🔎 Similar Papers
No similar papers found.
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
J
Jihwan Lee
University of Southern California
Anfeng Xu
Anfeng Xu
University of Southern California
Speech ProcessingMultimodal AILLMDeep Learning
Y
Yoonjeong Lee
University of Southern California
T
Thanathai Lertpetchpun
University of Southern California
X
Xuan Shi
University of Southern California
H
Helin Wang
Johns Hopkins University
Thomas Thebaud
Thomas Thebaud
Assistant Research Scientist, ECE Dept., Johns Hopkins University, Baltimore
Adversarial and Backdoor attacksSpeech Emotion RecognitionAudio LLMsSpeaker Characterisation
L
L. Moro-Velázquez
Johns Hopkins University
D
Dani Byrd
University of Southern California
N
N. Dehak
Johns Hopkins University
S
Shrikanth S. Narayanan
University of Southern California