Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the scarcity of paired text–speech data for low-resource language TTS, this paper proposes ASR-guided Group Relative Policy Optimization (GRPO), an online reinforcement learning framework that optimizes TTS models without requiring paired data. The method integrates multilingual pretraining, IPA-based phonemic modeling, and a multi-objective reward function—leveraging a pretrained multilingual ASR model to provide language-agnostic intelligibility feedback, while incorporating speaker verification and speech quality assessment for holistic reward estimation. Experiments demonstrate significant improvements in intelligibility and speaker consistency on low-resource languages over standard fine-tuning, and superior performance over offline-aligned DPO on high-resource languages. The framework exhibits strong generalizability, effectiveness, and robustness across resource-scarce and resource-rich settings, establishing a novel paradigm for unsupervised and weakly supervised TTS.

Technology Category

Application Category

📝 Abstract

Developing high-quality text-to-speech (TTS) systems for low-resource languages is challenging due to the scarcity of paired text and speech data. In contrast, automatic speech recognition (ASR) models for such languages are often more accessible, owing to large-scale multilingual pre-training efforts. We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. Our method first establishes a language-agnostic foundation for TTS synthesis by training a multilingual baseline with International Phonetic Alphabet (IPA) tokens. Next, we fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features. Finally, we apply GRPO to optimize the model using only unpaired text and speaker prompts, guided by a multi-objective reward from pretrained ASR, speaker verification, and audio quality estimation models. Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Furthermore, our GRPO-based framework also improves TTS performance in high-resource languages, surpassing offline alignment methods such as Direct Preference Optimization (DPO) yielding superior intelligibility, speaker similarity, and audio quality.

Problem

Research questions and friction points this paper is trying to address.

Developing TTS systems for low-resource languages with limited data

Using ASR-guided optimization to enhance speech synthesis quality

Improving intelligibility and speaker consistency across resource conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multilingual IPA tokens for language-agnostic foundation

Fine-tunes model on limited paired data for prosodic features

Applies GRPO with ASR-guided rewards for unpaired optimization

🔎 Similar Papers

No similar papers found.