NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the lack of standardized evaluation for speech synthesis systems in generating, localizing, and emphasizing nonverbal vocalizations—such as laughter and sighs—while preserving naturalness. We propose NVBench, the first bilingual (Mandarin–English) benchmark for nonverbal vocalizations, featuring a taxonomy of 45 vocalization types, a curated dataset, and a multidimensional evaluation protocol that disentangles general speech quality from controllability, positional accuracy, and salience of nonverbal cues. Through a combination of objective metrics, human listening tests, and LLM-assisted multi-rater subjective evaluations, we systematically assess 15 state-of-the-art TTS systems. Our findings reveal that controllability of nonverbal vocalizations often decouples from overall speech quality, and that low signal-to-noise oral cues and long-duration affective vocalizations remain significant challenges, thereby demonstrating NVBench’s effectiveness in advancing expressive speech synthesis.

Technology Category

Application Category

📝 Abstract

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.

Problem

Research questions and friction points this paper is trying to address.

non-verbal vocalizations

speech synthesis

benchmarking

controllability

salience

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-verbal vocalizations

speech synthesis benchmark

multi-axis evaluation