QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic OCR faces inherent challenges including cursive script, diacritical marks (tashkeel), font and layout variability. To address these, we introduce QARI—a series of open-source, Arabic-specific multimodal OCR models built upon Qwen2-VL-2B-Instruct. Our method pioneers an iterative, synthetic-data-driven fine-tuning paradigm integrating vision–language joint alignment, Arabic-customized tokenization, and rule-aware post-processing. QARI v0.2 achieves state-of-the-art performance on standard benchmarks: WER = 0.160, CER = 0.061, and BLEU = 0.737—demonstrating substantial improvements in tashkeel recognition accuracy and robustness to low-resolution inputs. Moreover, the model supports document structure understanding and handwritten Arabic text recognition. All models, training code, and the synthetic dataset are fully open-sourced to foster reproducibility and community advancement.

Technology Category

Application Category

📝 Abstract
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
Problem

Research questions and friction points this paper is trying to address.

Overcoming Arabic script complexities for accurate OCR
Enhancing diacritical mark recognition in Arabic texts
Improving OCR performance on low-resolution Arabic documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model Adaptation
Iterative fine-tuning on synthetic datasets
Superior handling of tashkeel and fonts
🔎 Similar Papers
No similar papers found.