🤖 AI Summary
To address the scarcity of paired text–speech data for low-resource language TTS, this paper proposes ASR-guided Group Relative Policy Optimization (GRPO), an online reinforcement learning framework that optimizes TTS models without requiring paired data. The method integrates multilingual pretraining, IPA-based phonemic modeling, and a multi-objective reward function—leveraging a pretrained multilingual ASR model to provide language-agnostic intelligibility feedback, while incorporating speaker verification and speech quality assessment for holistic reward estimation. Experiments demonstrate significant improvements in intelligibility and speaker consistency on low-resource languages over standard fine-tuning, and superior performance over offline-aligned DPO on high-resource languages. The framework exhibits strong generalizability, effectiveness, and robustness across resource-scarce and resource-rich settings, establishing a novel paradigm for unsupervised and weakly supervised TTS.
📝 Abstract
Developing high-quality text-to-speech (TTS) systems for low-resource languages is challenging due to the scarcity of paired text and speech data. In contrast, automatic speech recognition (ASR) models for such languages are often more accessible, owing to large-scale multilingual pre-training efforts. We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. Our method first establishes a language-agnostic foundation for TTS synthesis by training a multilingual baseline with International Phonetic Alphabet (IPA) tokens. Next, we fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features. Finally, we apply GRPO to optimize the model using only unpaired text and speaker prompts, guided by a multi-objective reward from pretrained ASR, speaker verification, and audio quality estimation models. Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Furthermore, our GRPO-based framework also improves TTS performance in high-resource languages, surpassing offline alignment methods such as Direct Preference Optimization (DPO) yielding superior intelligibility, speaker similarity, and audio quality.