KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Arabic OCR faces unique challenges—including cursive ligatures, right-to-left layout, and high typographic variability—while existing evaluation frameworks suffer from critical gaps in PDF-to-Markdown conversion, numeral recognition, morphological stretching, and table structure parsing. To address this, we introduce the first comprehensive Arabic OCR and document understanding benchmark, comprising 8,809 samples across nine domains and 36 fine-grained categories. We propose the first systematic, multidimensional evaluation framework, explicitly defining 21 commercial chart recognition tasks and PDF-to-Markdown conversion. Evaluation employs dual metrics—Character Error Rate (CER) and structural accuracy—across state-of-the-art vision-language models (VLMs: GPT-4, Gemini, Qwen) and traditional OCR engines (EasyOCR, PaddleOCR). Results show VLMs achieve 60% lower average CER than traditional OCR; notably, Gemini-2.0-Flash attains only 65% accuracy on PDF-to-Markdown, exposing a fundamental bottleneck in Arabic document understanding. This work establishes a reproducible benchmark and identifies concrete directions for advancement.

Technology Category

Application Category

📝 Abstract

With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

Problem

Research questions and friction points this paper is trying to address.

Develops benchmark for Arabic OCR challenges.

Evaluates modern models on diverse Arabic documents.

Identifies gaps in Arabic PDF-to-Markdown conversion.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic OCR benchmark

Vision-language models

Multi-domain document analysis

🔎 Similar Papers

No similar papers found.