Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core challenges in developing Arabic large language models (LLMs): high data noise, inadequate tokenizer adaptation, and weak evaluation frameworks. To tackle these, we propose a systematic solution comprising: (1) a multi-stage Arabic-specific data cleaning framework; (2) a hybrid tokenizer integrating morphology-aware segmentation with subword tokenization; and (3) a multidimensional evaluation benchmark covering linguistic understanding, generation quality, and cultural appropriateness. Leveraging a massive, high-quality Arabic corpus, we implement an end-to-end customized pretraining pipeline. The resulting open-source foundational model achieves an average 12.3% improvement over state-of-the-art open Arabic LLMs across major Arabic benchmarks, with substantially enhanced reasoning and generation capabilities. We publicly release the cleaned dataset, tokenizer toolkit, and evaluation suite to foster sustainable advancement of the Arabic AI ecosystem.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.
Problem

Research questions and friction points this paper is trying to address.

Addressing data curation challenges for Arabic language models
Evaluating tokenizer design impact on Arabic model performance
Proposing systematic corrections for Arabic evaluation frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated and filtered Arabic pre-training datasets
Designed tokenizers to optimize model performance
Proposed systematic correction for evaluation frameworks
🔎 Similar Papers
No similar papers found.
A
Areej AlOtaibi
L
Lina Alyahya
R
Raghad Alshabanah
S
Shahad Alfawzan
S
Shuruq Alarefei
R
Reem Alsabti
N
Nouf Alsubaie
A
Abdulaziz Alhuzaymi
L
Lujain Alkhelb
M
Majd Alsayari
W
Waad Alahmed
O
Omar Talabay
J
Jalal Alowibdi
S
Salem Alelyani
Adel Bibi
Adel Bibi
University of Oxford
AI SafetyAI SecurityMachine Learning