MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic lacks large-scale, multi-domain aligned word-definition datasets, hindering progress in natural language processing and lexicography. To address this gap, this work presents MURAD, the first open-vocabulary dataset for Arabic comprising 96,243 word-definition pairs, curated from authoritative dictionaries spanning linguistics, Islamic studies, mathematics, physics, psychology, and engineering. A hybrid data extraction pipeline—integrating direct text parsing, optical character recognition (OCR), and automated reconstruction techniques—ensures high accuracy, consistent formatting, and rich source-domain metadata. MURAD supports diverse downstream tasks such as reverse dictionary modeling and semantic retrieval, offering a high-quality foundational resource for Arabic computational linguistics, educational tool development, and reproducible research.

Technology Category

Application Category

📝 Abstract
Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.
Problem

Research questions and friction points this paper is trying to address.

Arabic lexical resources
reverse dictionary
multi-domain dataset
lexical semantics
natural language processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic lexical semantics
reverse dictionary
multi-domain dataset
hybrid data extraction
optical character recognition
🔎 Similar Papers
No similar papers found.
Serry Sibaee
Serry Sibaee
Research Engineer
Arabic Natural Language processingNLP
Y
Yasser AlHabashi
Robotics and Internet-of-Things Laboratory (RIOTU), Prince Sultan University, Riyadh 11586, Saudi Arabia
N
Nadia Sibai
Robotics and Internet-of-Things Laboratory (RIOTU), Prince Sultan University, Riyadh 11586, Saudi Arabia
Y
Yara Farouk
Robotics and Internet-of-Things Laboratory (RIOTU), Prince Sultan University, Riyadh 11586, Saudi Arabia
A
A. Ammar
Robotics and Internet-of-Things Laboratory (RIOTU), Prince Sultan University, Riyadh 11586, Saudi Arabia
S
Sawsan Alhalawani
Robotics and Internet-of-Things Laboratory (RIOTU), Prince Sultan University, Riyadh 11586, Saudi Arabia
Wadii Boulila
Wadii Boulila
Professor of Computer Science, Leader of Robotics & Internet of Things Lab, Prince Sultan University
Data ScienceMachine LearningUncertainty ModelingRemote SensingComputer Vision