🤖 AI Summary
This work addresses the limited progress in Arabic multi-dialectal text-to-speech (TTS) synthesis, which has been hindered by the absence of unified modeling approaches, standardized datasets, and evaluation benchmarks. The authors propose the first open-source, end-to-end TTS system specifically designed for multi-dialectal Arabic, leveraging publicly available automatic speech recognition (ASR) corpora to construct a unified training framework. By integrating linguistically informed curriculum learning and in-context learning mechanisms, the system enables high-quality synthesis across both high- and low-resource dialects without relying on diacritized input text. This study establishes the first open-source TTS models and evaluation benchmark for multi-dialectal Arabic, demonstrating superior synthesis quality compared to mainstream commercial services and significantly advancing standardization and reproducibility in the field.
📝 Abstract
A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at https://SWivid.github.io/Habibi/ .