🤖 AI Summary
This work addresses the challenge of limited naturalness in speech synthesis for Quechua—a low-resource Indigenous language—alongside Spanish, Peru’s high-resource official language, particularly in the context of constitutional texts. To this end, the authors develop a unified multilingual text-to-speech (TTS) pipeline that integrates three state-of-the-art architectures: XTTS v2, F5-TTS, and DiFlow-TTS. By leveraging cross-lingual transfer learning, the system effectively utilizes heterogeneous bilingual data to enhance synthesis quality. The approach innovatively combines bilingual legal corpora with multilingual TTS capabilities, significantly improving Quechua speech naturalness while preserving high-quality Spanish output. The project releases all model checkpoints, inference code, and bilingual audio recordings of constitutional articles, offering a reproducible and valuable resource for Indigenous language preservation and multilingual legal speech technologies.
📝 Abstract
We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.