🤖 AI Summary
Tibetan language resources are severely scarce, particularly high-quality parallel speech corpora for the Ü-Tsang, Amdo, and Kham dialects—hindering the development of multi-dialect text-to-speech (TTS). This paper proposes the first unified multi-dialect TTS framework. Methodologically, it introduces a dialect fusion module and a Dialect-Specific Dynamic Routing Network (DSDR-Net) to explicitly model fine-grained cross-dialect acoustic and linguistic variations; integrates dialect label conditioning, shared multi-dialect representation learning, and dynamic feature routing. Evaluated on objective metrics (MCD, F0 RMSE) and subjective MOS scores, our approach significantly outperforms both single-dialect and multi-task baselines. Furthermore, it successfully generalizes to voice-conversion-based dialect translation, demonstrating high naturalness and practical applicability. The key contributions include: (1) the first end-to-end unified architecture for Tibetan multi-dialect TTS; (2) DSDR-Net for adaptive, dialect-aware feature routing; and (3) empirical validation across synthesis and conversion tasks.
📝 Abstract
Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (Ü-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects. Extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness. We further validate the quality and utility of the synthesized speech through a challenging Speech-to-Speech Dialect Conversion (S2SDC) task.