Zero-Shot Text-to-Speech for Vietnamese

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited performance of zero-shot text-to-speech (TTS) for Vietnamese—a low-resource language—by constructing and open-sourcing PhoAudiobook, the first large-scale, high-quality Vietnamese speech dataset (941 hours), specifically designed for zero-shot TTS evaluation and training. Leveraging PhoAudiobook, we systematically benchmark three state-of-the-art cross-lingual TTS models—VALL-E, VoiceCraft, and XTTS-V2—on Vietnamese synthesis. Our experiments reveal, for the first time, that VALL-E and VoiceCraft exhibit strong cross-lingual robustness in short-utterance synthesis; moreover, PhoAudiobook consistently improves all models across critical metrics—including naturalness, speaker similarity, and intelligibility. This work fills a critical gap by establishing the first high-fidelity, zero-shot TTS benchmark for Vietnamese and provides a reproducible, open-source infrastructure and empirical foundation for zero-shot TTS research in low-resource languages.

Technology Category

Application Category

📝 Abstract
This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.
Problem

Research questions and friction points this paper is trying to address.

Introducing PhoAudiobook dataset for Vietnamese TTS
Evaluating zero-shot TTS models on Vietnamese audio
Enhancing model performance with PhoAudiobook dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PhoAudiobook dataset for Vietnamese TTS
Evaluates VALL-E, VoiceCraft, and XTTS-V2 models
PhoAudiobook enhances model performance significantly
🔎 Similar Papers
No similar papers found.