CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current CapTTS research is hindered by the lack of large-scale, application-oriented datasets and ambiguous definitions of downstream tasks, limiting practical deployment. Method: We introduce CapSpeech—the first large-scale, downstream-application-focused benchmark for CapTTS—covering four stylistic TTS tasks: sound events, accents, emotions, and conversational agents. It comprises over 10 million machine-annotated and 360k human-refined audio-text pairs. We formally define the CapTTS downstream task taxonomy, propose two novel tasks—AgentTTS and CapTTS-SE—accompanied by professionally recorded data, and adopt a FastSpeech2/VITS hybrid architecture with a “professional recording + dual-track annotation” paradigm. Contribution/Results: Experiments demonstrate that CapSpeech significantly improves model generalization and fine-grained style controllability, achieving high fidelity and intelligibility in multi-style synthesis. It establishes a unified evaluation standard and a strong baseline for CapTTS research and development.

Technology Category

Application Category

📝 Abstract
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized datasets for style-captioned TTS tasks.
Limited research on downstream applications of CapTTS technology.
Need for high-quality annotated data for diverse speaking styles.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CapSpeech benchmark for CapTTS tasks
Includes 10M machine and 0.36M human annotations
Uses autoregressive and non-autoregressive models
🔎 Similar Papers
No similar papers found.