NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Traditional ASR and TTS systems typically neglect paralinguistic vocalizations—such as laughter, breathing, and filled pauses (“uh”, “oh”)—despite their critical role in emotional expression and conversational interaction. To address this, we propose the first end-to-end paralinguistic-aware Mandarin speech modeling framework: (1) We formulate paralinguistic information as learnable, decodable head tokens, enabling unified paralinguistic recognition and controllable synthesis; (2) We construct the first large-scale word-level annotated Chinese paralinguistic speech dataset—comprising 48k manually and 174k automatically labeled utterances (573 hours total); (3) Leveraging this dataset, we train a paralinguistic-aware ASR system and fine-tune a zero-shot TTS model to generate context-aware paralinguistic vocalizations. Experiments demonstrate substantial improvements in speech naturalness, expressiveness, and controllability over baseline systems.

Technology Category

Application Category

📝 Abstract

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Recognizing and synthesizing paralinguistic vocalizations in speech

Integrating non-verbal sounds into ASR and TTS systems

Creating scalable pipeline for expressive Mandarin speech modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually annotated dataset with paralinguistic categories

Paralinguistic-aware ASR model with inline tokens

Zero-shot TTS with controllable vocalizations

🔎 Similar Papers

No similar papers found.