Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the multilingual text-to-speech (TTS) capability of the English F5-TTS model on 11 low-resource Indian languages. Addressing challenges in multilingual fluency, cross-lingual voice/style cloning, and code-mixed speech synthesis, we propose a monolingual Indian-language supervised fine-tuning paradigm—demonstrating for the first time its empirical superiority over multilingual joint fine-tuning. Our method integrates speech style transfer, human-in-the-loop synthetic data construction, and code-mixed linguistic modeling, coupled with a computationally efficient, data-constrained training strategy. The resulting IN-F5 model achieves human-level performance across multiple objective and subjective metrics, enabling zero-shot language synthesis (e.g., Bhojpuri) and cross-lingual voice conversion (e.g., Odia speaker uttering Hindi). We open-source a comprehensive evaluation benchmark, establishing a reproducible, high-fidelity, and lightweight adaptation framework for low-resource multilingual TTS.

Technology Category

Application Category

📝 Abstract
What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating English F5-TTS adaptation to 11 Indian languages
Comparing training methods for low-resource TTS performance
Synthesizing unseen languages via human-in-the-loop zero-resource TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune English F5 on Indian data
Human-in-the-loop for zero-resource TTS
Compute optimal strategy for low-resource
🔎 Similar Papers
No similar papers found.
P
Praveen Srinivasa Varadhan
AI4Bharat, Indian Institute of Technology Madras, India
Srija Anand
Srija Anand
MS by Research, AI4Bharat, IIT Madras
Speech SynthesisNatural Language ProcessingLLM Evaluation
S
Soma Siddhartha
Saryps Labs, India
M
Mitesh M. Khapra
AI4Bharat, Indian Institute of Technology Madras, India