BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses key challenges in Taiwanese Mandarin text-to-speech (TTS): polyphonic character ambiguity, code-switching between Mandarin and English, and poor generalization to long-tail speakers. We propose the first end-to-end TTS framework specifically designed for Taiwanese Mandarin. Methodologically, we integrate an S³ tokenizer, a large language model (LLM), and an optimal transport–conditioned flow matching (OT-CFM) acoustic model within the CosyVoice architecture, augmented with a grapheme-to-phoneme (G2P) module. This enables context-aware polyphonic character disambiguation, fine-grained tone modeling, and natural prosodic contour generation. Experiments demonstrate state-of-the-art performance on both general and code-switching test sets—achieving superior naturalness, intelligibility, and polyphonic character accuracy—while significantly improving cross-speaker robustness and adaptability to mixed-language utterances. The results validate the effectiveness and practicality of our multi-component, synergistic architecture for regionalized TTS.

Technology Category

Application Category

📝 Abstract

We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.

Problem

Research questions and friction points this paper is trying to address.

Mandarin Speech Synthesis

Taiwanese Accent Adaptation

Multilingual Speaker Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mandarin Speech Synthesis

Polyphony Handling

Advanced Model Integration

🔎 Similar Papers

No similar papers found.