Fanar 2.0: Arabic Generative AI Stack

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenges of scarce Arabic web data—comprising merely 0.5% of global content—and limited computational resources by proposing a training paradigm that prioritizes high-quality data over sheer volume. Building upon Gemma-3-27B, the project conducts sustained pretraining on 256 H100 GPUs and employs model merging to develop Fanar-27B, a sovereign, controllable large language model tailored for Arabic. The system integrates a full-stack generative AI architecture featuring speech (Aura), vision (Oryx), multi-agent frameworks, cultural alignment mechanisms, and bilingual safety filtering (FanarGuard). Fanar-27B demonstrates significant gains over existing models in Arabic knowledge (+9.1), language proficiency (+7.3), dialect handling (+3.5), and English capability (+7.6), while uniquely enabling high-quality poetry generation, Islamic content processing, and bilingual translation—achieving international competitiveness despite its constrained scale.

Technology Category

Application Category

📝 Abstract

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

Problem

Research questions and friction points this paper is trying to address.

Arabic Generative AI

resource-constrained AI

sovereign AI

low-resource language

multimodal AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

resource-constrained AI

Arabic-centric LLM

continual pre-training