TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tibetan language resources are severely scarce, particularly high-quality parallel speech corpora for the Ü-Tsang, Amdo, and Kham dialects—hindering the development of multi-dialect text-to-speech (TTS). This paper proposes the first unified multi-dialect TTS framework. Methodologically, it introduces a dialect fusion module and a Dialect-Specific Dynamic Routing Network (DSDR-Net) to explicitly model fine-grained cross-dialect acoustic and linguistic variations; integrates dialect label conditioning, shared multi-dialect representation learning, and dynamic feature routing. Evaluated on objective metrics (MCD, F0 RMSE) and subjective MOS scores, our approach significantly outperforms both single-dialect and multi-task baselines. Furthermore, it successfully generalizes to voice-conversion-based dialect translation, demonstrating high naturalness and practical applicability. The key contributions include: (1) the first end-to-end unified architecture for Tibetan multi-dialect TTS; (2) DSDR-Net for adaptive, dialect-aware feature routing; and (3) empirical validation across synthesis and conversion tasks.

Technology Category

Application Category

📝 Abstract
Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (Ü-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects. Extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness. We further validate the quality and utility of the synthesized speech through a challenging Speech-to-Speech Dialect Conversion (S2SDC) task.
Problem

Research questions and friction points this paper is trying to address.

Limited parallel speech corpora for Tibetan's three major dialects
Challenges in speech modeling for low-resource Tibetan language
Need for unified multi-dialect text-to-speech synthesis framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-dialect TTS framework with explicit labels
Dialect fusion module and DSDR-Net for variations
Synthesizes parallel dialectal speech for low-resource language
🔎 Similar Papers
No similar papers found.
Y
Yutong Liu
School of Information and Software Engineering, University of Electronic Science and Technology of China, China
Z
Ziyue Zhang
School of Information and Software Engineering, University of Electronic Science and Technology of China, China
B
Ban Ma-bao
School of Information and Software Engineering, University of Electronic Science and Technology of China, China
R
Renzeng Duojie
School of Information Science and Technology, Tibet University, China
Y
Yuqing Cai
School of Information and Software Engineering, University of Electronic Science and Technology of China, China
Yongbin Yu
Yongbin Yu
University of Electronic Science and Technology of China
Memristor、Neural Network、Natural Language Processing、Impulsive Control、Swarm Intelligence、EDA、MBSE
Xiangxiang Wang
Xiangxiang Wang
University of Electronic Science and Technology of China
neural networkstime scalesnonlinear systemsimpulsive control
Fan Gao
Fan Gao
Caltech; MIT
NGS BioinformaticsImage data processingAI/MLNeurodegenerationProtein Bioinformatics
C
Cheng Huang
Department of Ophthalmology, University of Texas Southwestern Medical Center, USA
N
Nyima Tashi
School of Information Science and Technology, Tibet University, China