π€ AI Summary
To address the challenges of speaker adaptation, low speech naturalness, and poor fidelity in few-shot text-to-speech (TTS) for low-resource Bengali, this paper proposes BnTTSβthe first few-shot TTS framework specifically designed for Bengali. We innovatively adapt the multilingual TTS model XTTS to Bengali by incorporating language-specific phonological characteristics into the modeling architecture and pretraining it on 3.85k hours of Bengali speech-text pairs. Leveraging multilingual transfer learning and speaker embedding fine-tuning, BnTTS enables zero-shot and few-shot speaker adaptation. Experimental results demonstrate that BnTTS significantly outperforms the current state-of-the-art Bengali TTS systems in naturalness (MOS), intelligibility, and speaker similarity. This work fills a critical technical gap in high-quality, adaptive TTS for low-resource languages.
π Abstract
This paper introduces BnTTS (Bangla Text-To-Speech), the first framework for Bangla speaker adaptation-based TTS, designed to bridge the gap in Bangla speech synthesis using minimal training data. Building upon the XTTS architecture, our approach integrates Bangla into a multilingual TTS pipeline, with modifications to account for the phonetic and linguistic characteristics of the language. We pre-train BnTTS on 3.85k hours of Bangla speech dataset with corresponding text labels and evaluate performance in both zero-shot and few-shot settings on our proposed test dataset. Empirical evaluations in few-shot settings show that BnTTS significantly improves the naturalness, intelligibility, and speaker fidelity of synthesized Bangla speech. Compared to state-of-the-art Bangla TTS systems, BnTTS exhibits superior performance in Subjective Mean Opinion Score (SMOS), Naturalness, and Clarity metrics.