Zero-Shot Text-to-Speech for Vietnamese

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the limited performance of zero-shot text-to-speech (TTS) for Vietnamese—a low-resource language—by constructing and open-sourcing PhoAudiobook, the first large-scale, high-quality Vietnamese speech dataset (941 hours), specifically designed for zero-shot TTS evaluation and training. Leveraging PhoAudiobook, we systematically benchmark three state-of-the-art cross-lingual TTS models—VALL-E, VoiceCraft, and XTTS-V2—on Vietnamese synthesis. Our experiments reveal, for the first time, that VALL-E and VoiceCraft exhibit strong cross-lingual robustness in short-utterance synthesis; moreover, PhoAudiobook consistently improves all models across critical metrics—including naturalness, speaker similarity, and intelligibility. This work fills a critical gap by establishing the first high-fidelity, zero-shot TTS benchmark for Vietnamese and provides a reproducible, open-source infrastructure and empirical foundation for zero-shot TTS research in low-resource languages.

Technology Category

Application Category

📝 Abstract

This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.

Problem

Research questions and friction points this paper is trying to address.

Introducing PhoAudiobook dataset for Vietnamese TTS

Evaluating zero-shot TTS models on Vietnamese audio

Enhancing model performance with PhoAudiobook dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PhoAudiobook dataset for Vietnamese TTS

Evaluates VALL-E, VoiceCraft, and XTTS-V2 models

PhoAudiobook enhances model performance significantly

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs