Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited progress in Arabic multi-dialectal text-to-speech (TTS) synthesis, which has been hindered by the absence of unified modeling approaches, standardized datasets, and evaluation benchmarks. The authors propose the first open-source, end-to-end TTS system specifically designed for multi-dialectal Arabic, leveraging publicly available automatic speech recognition (ASR) corpora to construct a unified training framework. By integrating linguistically informed curriculum learning and in-context learning mechanisms, the system enables high-quality synthesis across both high- and low-resource dialects without relying on diacritized input text. This study establishes the first open-source TTS models and evaluation benchmark for multi-dialectal Arabic, demonstrating superior synthesis quality compared to mainstream commercial services and significantly advancing standardization and reproducibility in the field.

Technology Category

Application Category

📝 Abstract
A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at https://SWivid.github.io/Habibi/ .
Problem

Research questions and friction points this paper is trying to address.

Arabic dialects
speech synthesis
unified modeling
benchmark
data standardization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Arabic TTS
Curriculum Learning
In-Context Learning
Dialectal Speech Synthesis
Open-Source Benchmark
🔎 Similar Papers
No similar papers found.
Yushen Chen
Yushen Chen
Shanghai Jiao Tong University
Speech and Language Processing
J
Junzhe Liu
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, Shanghai Jiao Tong University
Y
Yujie Tu
Shanghai Innovation Institute; University of Chinese Academy of Sciences
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Yuzhe Liang
Yuzhe Liang
Shanghai Jiao Tong University
Deep learningMultimodal Learning
K
Kai Yu
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, Shanghai Jiao Tong University
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
C
Chen Zhang
Kuaishou Technology
Xie Chen
Xie Chen
Shanghai Jiao Tong University <- Microsoft <- Cambridge University
Machine LearningSpeech RecognitionSpeech SynthesisSpeech&Audio Processing