🤖 AI Summary
Arabic OCR faces inherent challenges including cursive script, diacritical marks (tashkeel), font and layout variability. To address these, we introduce QARI—a series of open-source, Arabic-specific multimodal OCR models built upon Qwen2-VL-2B-Instruct. Our method pioneers an iterative, synthetic-data-driven fine-tuning paradigm integrating vision–language joint alignment, Arabic-customized tokenization, and rule-aware post-processing. QARI v0.2 achieves state-of-the-art performance on standard benchmarks: WER = 0.160, CER = 0.061, and BLEU = 0.737—demonstrating substantial improvements in tashkeel recognition accuracy and robustness to low-resolution inputs. Moreover, the model supports document structure understanding and handwritten Arabic text recognition. All models, training code, and the synthetic dataset are fully open-sourced to foster reproducibility and community advancement.
📝 Abstract
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.