ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of fine-grained, city-level annotations in existing Arabic multi-dialect datasets, which has hindered high-precision dialect identification research. To bridge this gap, the authors introduce ARCADE, the first city-scale Arabic speech corpus, comprising 3,790 broadcast audio segments from 58 cities across 19 countries and 6,907 multidimensional metadata annotations provided by native speakers, covering dialect, register, and emotion. By integrating automated streaming audio collection, 30-second segment extraction, and multi-native speaker validation, ARCADE achieves unprecedented fine-grained dialect labeling at the city level, encompassing both Modern Standard Arabic (MSA) and diverse dialectal varieties. The dataset is publicly available on Hugging Face, offering a high-quality benchmark for fine-grained dialect recognition and multi-task learning in Arabic speech processing.

Technology Category

Application Category

📝 Abstract
The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type, dialect category, and a validity flag for dialect identification tasks. The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries. These fine-grained annotations enable robust multi-task learning, serving as a benchmark for city-level dialect tagging. We detail the data collection methodology, assess audio quality, and provide a comprehensive analysis of label distributions. The dataset is available on: https://huggingface.co/datasets/riotu-lab/ARCADE-full
Problem

Research questions and friction points this paper is trying to address.

Arabic dialect
city-level tagging
speech corpus
dialect identification
fine-grained annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained dialect tagging
city-level Arabic dialect
speech corpus
multi-task learning
Arabic radio speech
🔎 Similar Papers
No similar papers found.
O
Omer Nacar
Tuwaiq Academy, Riyadh 13415, Saudi Arabia
Serry Sibaee
Serry Sibaee
Research Engineer
Arabic Natural Language processingNLP
A
A. Ammar
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
Yasser Alhabashi
Yasser Alhabashi
Unknown affiliation
N
N. Sibai
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
Y
Y. Ahmed
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
A
Ahmed Saud Alqusaiyer
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
S
Sulieman Mahmoud AlMahmoud
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
A
Abdulrhman Mamdoh Mukhaniq
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
L
Lubaba R. Raed
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
S
Sulaiman Mohammed Alatwah
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
W
Waad Nasser Alqahtani
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
Y
Yousif Alnasser
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
M
Mohamed Aziz Khadraoui
Higher School of Communication of Tunis (SUP’COM), Ariana 2083, Tunisia
Wadii Boulila
Wadii Boulila
Professor of Computer Science, Leader of Robotics & Internet of Things Lab, Prince Sultan University
Data ScienceMachine LearningUncertainty ModelingRemote SensingComputer Vision