π€ AI Summary
To address the longstanding scarcity of high-quality, phonemically annotated open-source multi-speaker datasets for Modern Standard Arabic (MSA) text-to-speech (TTS), this work introduces ArVoiceβthe first publicly available multi-speaker MSA TTS dataset. ArVoice integrates professionally recorded speech, curated open-source corpora, and high-fidelity synthetic speech generated by advanced TTS models, encompassing 11 speakers and 83.52 hours of audio, all accompanied by phoneme-level forced alignments. It is the first dataset to simultaneously provide multi-speaker MSA speech and corresponding phonemic annotations under an open license. ArVoice significantly advances research in MSA TTS, phoneme recovery, voice conversion (VC), and deepfake detection. Empirical validation using state-of-the-art TTS models (e.g., FastSpeech2, VITS) and VC systems confirms its effectiveness. The dataset is publicly released for non-commercial academic use.
π Abstract
We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.