Fanar 2.0: Arabic Generative AI Stack

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of scarce Arabic web data—comprising merely 0.5% of global content—and limited computational resources by proposing a training paradigm that prioritizes high-quality data over sheer volume. Building upon Gemma-3-27B, the project conducts sustained pretraining on 256 H100 GPUs and employs model merging to develop Fanar-27B, a sovereign, controllable large language model tailored for Arabic. The system integrates a full-stack generative AI architecture featuring speech (Aura), vision (Oryx), multi-agent frameworks, cultural alignment mechanisms, and bilingual safety filtering (FanarGuard). Fanar-27B demonstrates significant gains over existing models in Arabic knowledge (+9.1), language proficiency (+7.3), dialect handling (+3.5), and English capability (+7.6), while uniquely enabling high-quality poetry generation, Islamic content processing, and bilingual translation—achieving international competitiveness despite its constrained scale.

Technology Category

Application Category

📝 Abstract
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
Problem

Research questions and friction points this paper is trying to address.

Arabic Generative AI
resource-constrained AI
sovereign AI
low-resource language
multimodal AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

resource-constrained AI
Arabic-centric LLM
continual pre-training
model merging
sovereign AI stack
🔎 Similar Papers
No similar papers found.
F
FANAR TEAM
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
U
Ummar Abbas
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
Mohammad Shahmeer Ahmad
Mohammad Shahmeer Ahmad
Research Engineer, Qatar Computing Research Institute
Information RetrievalData Centric AIAI SystemsLLMs
M
Minhaj Ahmad
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
A
Abdulaziz Al-Homaid
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
A
Anas Al-Nuaimi
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
E
Enes Altinisik
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
Ehsaneddin Asgari
Ehsaneddin Asgari
Scientist at QCRI, UC Berkeley PhD Alum., Prev@ Helmholtz Center, MIT-CSAIL, MIT-BCS, LMU, EPFL, SUT
Natural Language ProcessingBioinformaticsDeep LearningDigital HumanitiesMachine Learning
S
Sanjay Chawla
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
S
Shammur Chowdhury
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
Fahim Dalvi
Fahim Dalvi
Qatar Computing Research Institute
Deep LearningMachine TranslationArtificial IntelligenceExplainable AI
Kareem Darwish
Kareem Darwish
QCRI
Information RetrievalNatural Language ProcessingArabic Natural Language ProcessingArabic NLP
Nadir Durrani
Nadir Durrani
Senior Scientist, QCRI, HBKU
Machine TranslationInterpretabilityTransliterationWord SegmentationNatural Language Processing
M
Mohamed Elfeky
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
Ahmed Elmagarmid
Ahmed Elmagarmid
Executive Director, Qatar Computing Research Institute
Database Systems
Mohamed Eltabakh
Mohamed Eltabakh
Principal Scientis at QCRI Qatar
AIDatabase SystemsData ScienceBig Data Management
A
Asim Ersoy
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
M
Masoomali Fatehkia
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
M
Mohammed Qusay Hashim
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
Majd Hawasly
Majd Hawasly
QCRI, Hamad Bin Khalifa University
Autonomous systemsLifelong learningNatural Language Processing
Mohamed Hefeeda
Mohamed Hefeeda
Simon Fraser University
Multimedia SystemsComputer NetworksMultimedia AI
M
Mus'ab Husaini
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
K
Keivin Isufaj
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
S
Soon-Gyo Jung
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University
H
Houssam Lachemat
Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University