🤖 AI Summary
This work addresses the limited fine-grained controllability—specifically accent, emotion, and speaking style—in multilingual text-to-speech (TTS) systems for Indian languages and English. To this end, we introduce RASMALAI, a large-scale, multi-attribute speech dataset spanning 23 Indian languages and English, comprising 13,000 hours of speech and 24 million fine-grained textual attribute annotations. We also release IndicParlerTTS, the first open-source, text-description-driven TTS system tailored for Indian languages. Our approach employs text-description conditioning, multi-task attribute disentanglement, and cross-lingual representation sharing to enable robust cross-lingual and cross-speaker emotion, accent, and style transfer. Extensive evaluation demonstrates state-of-the-art performance across key metrics—including named-speaker synthesis fidelity, description adherence, attribute accuracy, and cross-lingual expressive transfer—establishing a new benchmark for controllable multilingual TTS in Indian languages.
📝 Abstract
We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.