🤖 AI Summary
Existing public paralinguistic datasets—e.g., laughter, sighs—suffer from fragmented utterances, missing or inaccurate timestamps, and low ecological validity, often relying on proprietary resources, thereby hindering progress in natural speech synthesis and paralinguistic understanding. To address these limitations, we propose the first fully automated framework for constructing large-scale paralinguistic datasets from authentic, spontaneous dialogues. Our pipeline integrates robust speech segmentation, fine-grained paralinguistic event detection (covering six categories), and precise temporal alignment. Leveraging this framework, we release SynParaSpeech—an open-source dataset comprising 118.75 hours of high-fidelity audio with millisecond-accurate timestamps. Empirical evaluation demonstrates substantial improvements: state-of-the-art text-to-speech models trained on SynParaSpeech achieve significantly enhanced prosodic naturalness, while paralinguistic event detection models attain markedly higher precision and recall. This work establishes a scalable, reproducible paradigm for paralinguistic data curation and advances both generative and analytical research in spoken language processing.
📝 Abstract
Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the SynParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale paralinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguistic event detection. The dataset and audio samples are available at https://github.com/ShawnPi233/SynParaSpeech.