NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of standardized evaluation methods and reliable reference audio for nonverbal vocalization (NV) synthesis in current text-to-speech systems. Treating NV as a communicative behavior, the study proposes the first benchmark for NV synthesis based on functional categorization, encompassing 14 distinct NV types, 1,651 multilingual real-world utterances, and their paired human-generated reference recordings. The authors introduce a two-dimensional evaluation protocol—measuring both instruction alignment and acoustic fidelity—and propose a novel objective metric, the Perceptual Communicative Error Rate (PCER), combined with distributional distance measures to enable automated assessment that correlates strongly with human subjective judgments. This framework establishes a standardized, reproducible evaluation system for NV synthesis, facilitating future research and development in expressive speech technologies.

Technology Category

Application Category

📝 Abstract
While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.
Problem

Research questions and friction points this paper is trying to address.

nonverbal vocalization
text-to-speech
evaluation benchmark
expressive speech synthesis
paralinguistic communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nonverbal Vocalization
Text-to-Speech Synthesis
Benchmark
Paralinguistic Character Error Rate
Expressive Speech
🔎 Similar Papers
No similar papers found.
Q
Qinke Ni
The Chinese University of Hong Kong, Shenzhen
Huan Liao
Huan Liao
The Chinese University of Hong Kong, Shenzhen
Speech SynthesisAudio generation
D
Dekun Chen
The Chinese University of Hong Kong, Shenzhen
Y
Yuxiang Wang
The Chinese University of Hong Kong, Shenzhen
Zhizheng Wu
Zhizheng Wu
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Mel Lab
Spoken Language ProcessingDeepFake detectionMusic Processing