Scaling Rich Style-Prompted Text-to-Speech Datasets

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing large-scale speech datasets provide only coarse-grained acoustic labels (e.g., “loud”, “slow”), while fine-grained abstract style labels (e.g., “glottal”, “anguished”) remain scarce due to prohibitive manual annotation costs. To address this, we introduce ParaSpeechCaps—the first large-scale, multi-label speech dataset enabling automatic annotation of 59 fine-grained acoustic styles, comprising 342 hours of human-annotated and 2,427 hours of automatically annotated speech (2,769 hours total). We propose a joint labeling framework integrating text and speech encoders, a multi-class classifier, and an audio-language model (ALM), enabling the first unified modeling of intrinsic speaker style and contextual utterance-level style. Fine-tuning Parler-TTS on ParaSpeechCaps improves style consistency and naturalness by +7.9% and +15.5% MOS, respectively. The dataset, models, and code are fully open-sourced.

Technology Category

Application Category

📝 Abstract

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

Problem

Research questions and friction points this paper is trying to address.

Scaling rich style annotations for text-to-speech datasets.

Automating rich tag annotations using text and speech embedders.

Improving style consistency and speech quality in TTS models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with rich style captions

Automated scaling of rich tag annotations

Improved style consistency and speech quality

🔎 Similar Papers

No similar papers found.