🤖 AI Summary
This study addresses key challenges in Taiwanese Mandarin text-to-speech (TTS): polyphonic character ambiguity, code-switching between Mandarin and English, and poor generalization to long-tail speakers. We propose the first end-to-end TTS framework specifically designed for Taiwanese Mandarin. Methodologically, we integrate an S³ tokenizer, a large language model (LLM), and an optimal transport–conditioned flow matching (OT-CFM) acoustic model within the CosyVoice architecture, augmented with a grapheme-to-phoneme (G2P) module. This enables context-aware polyphonic character disambiguation, fine-grained tone modeling, and natural prosodic contour generation. Experiments demonstrate state-of-the-art performance on both general and code-switching test sets—achieving superior naturalness, intelligibility, and polyphonic character accuracy—while significantly improving cross-speaker robustness and adaptability to mixed-language utterances. The results validate the effectiveness and practicality of our multi-component, synergistic architecture for regionalized TTS.
📝 Abstract
We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.