🤖 AI Summary
This study addresses the limited performance of zero-shot text-to-speech (TTS) for Vietnamese—a low-resource language—by constructing and open-sourcing PhoAudiobook, the first large-scale, high-quality Vietnamese speech dataset (941 hours), specifically designed for zero-shot TTS evaluation and training. Leveraging PhoAudiobook, we systematically benchmark three state-of-the-art cross-lingual TTS models—VALL-E, VoiceCraft, and XTTS-V2—on Vietnamese synthesis. Our experiments reveal, for the first time, that VALL-E and VoiceCraft exhibit strong cross-lingual robustness in short-utterance synthesis; moreover, PhoAudiobook consistently improves all models across critical metrics—including naturalness, speaker similarity, and intelligibility. This work fills a critical gap by establishing the first high-fidelity, zero-shot TTS benchmark for Vietnamese and provides a reproducible, open-source infrastructure and empirical foundation for zero-shot TTS research in low-resource languages.
📝 Abstract
This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.