🤖 AI Summary
This study investigates the multilingual text-to-speech (TTS) capability of the English F5-TTS model on 11 low-resource Indian languages. Addressing challenges in multilingual fluency, cross-lingual voice/style cloning, and code-mixed speech synthesis, we propose a monolingual Indian-language supervised fine-tuning paradigm—demonstrating for the first time its empirical superiority over multilingual joint fine-tuning. Our method integrates speech style transfer, human-in-the-loop synthetic data construction, and code-mixed linguistic modeling, coupled with a computationally efficient, data-constrained training strategy. The resulting IN-F5 model achieves human-level performance across multiple objective and subjective metrics, enabling zero-shot language synthesis (e.g., Bhojpuri) and cross-lingual voice conversion (e.g., Odia speaker uttering Hindi). We open-source a comprehensive evaluation benchmark, establishing a reproducible, high-fidelity, and lightweight adaptation framework for low-resource multilingual TTS.
📝 Abstract
What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.