🤖 AI Summary
Current CapTTS research is hindered by the lack of large-scale, application-oriented datasets and ambiguous definitions of downstream tasks, limiting practical deployment.
Method: We introduce CapSpeech—the first large-scale, downstream-application-focused benchmark for CapTTS—covering four stylistic TTS tasks: sound events, accents, emotions, and conversational agents. It comprises over 10 million machine-annotated and 360k human-refined audio-text pairs. We formally define the CapTTS downstream task taxonomy, propose two novel tasks—AgentTTS and CapTTS-SE—accompanied by professionally recorded data, and adopt a FastSpeech2/VITS hybrid architecture with a “professional recording + dual-track annotation” paradigm.
Contribution/Results: Experiments demonstrate that CapSpeech significantly improves model generalization and fine-grained style controllability, achieving high fidelity and intelligibility in multi-style synthesis. It establishes a unified evaluation standard and a strong baseline for CapTTS research and development.
📝 Abstract
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.