🤖 AI Summary
Traditional ASR and TTS systems typically neglect paralinguistic vocalizations—such as laughter, breathing, and filled pauses (“uh”, “oh”)—despite their critical role in emotional expression and conversational interaction. To address this, we propose the first end-to-end paralinguistic-aware Mandarin speech modeling framework: (1) We formulate paralinguistic information as learnable, decodable head tokens, enabling unified paralinguistic recognition and controllable synthesis; (2) We construct the first large-scale word-level annotated Chinese paralinguistic speech dataset—comprising 48k manually and 174k automatically labeled utterances (573 hours total); (3) Leveraging this dataset, we train a paralinguistic-aware ASR system and fine-tune a zero-shot TTS model to generate context-aware paralinguistic vocalizations. Experiments demonstrate substantial improvements in speech naturalness, expressiveness, and controllability over baseline systems.
📝 Abstract
Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.