TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Tibetan language resources are severely scarce, particularly high-quality parallel speech corpora for the Ü-Tsang, Amdo, and Kham dialects—hindering the development of multi-dialect text-to-speech (TTS). This paper proposes the first unified multi-dialect TTS framework. Methodologically, it introduces a dialect fusion module and a Dialect-Specific Dynamic Routing Network (DSDR-Net) to explicitly model fine-grained cross-dialect acoustic and linguistic variations; integrates dialect label conditioning, shared multi-dialect representation learning, and dynamic feature routing. Evaluated on objective metrics (MCD, F0 RMSE) and subjective MOS scores, our approach significantly outperforms both single-dialect and multi-task baselines. Furthermore, it successfully generalizes to voice-conversion-based dialect translation, demonstrating high naturalness and practical applicability. The key contributions include: (1) the first end-to-end unified architecture for Tibetan multi-dialect TTS; (2) DSDR-Net for adaptive, dialect-aware feature routing; and (3) empirical validation across synthesis and conversion tasks.

Technology Category

Application Category

📝 Abstract

Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (Ü-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects. Extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness. We further validate the quality and utility of the synthesized speech through a challenging Speech-to-Speech Dialect Conversion (S2SDC) task.

Problem

Research questions and friction points this paper is trying to address.

Limited parallel speech corpora for Tibetan's three major dialects

Challenges in speech modeling for low-resource Tibetan language

Need for unified multi-dialect text-to-speech synthesis framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-dialect TTS framework with explicit labels

Dialect fusion module and DSDR-Net for variations

Synthesizes parallel dialectal speech for low-resource language

🔎 Similar Papers

No similar papers found.

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

AI Research Scientist - Meta Superintelligence Labs (PhD)