SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Arabic OCR suffers from a scarcity of large-scale, structured, and typographically realistic book-level datasets; existing resources are predominantly limited to word- or line-level samples with insufficient font and layout diversity. To address this, we introduce SARD—the first large-scale synthetic Arabic OCR dataset explicitly designed for book-level typography—comprising 843,000 document images and 690 million words, spanning 10 fonts and accurately modeling Arabic ligature formation, bidirectional text flow, and hierarchical page layouts. SARD uniquely unifies scale, font variability, book-level layout complexity, and controllable synthesis while eliminating scanner noise, thereby enabling joint layout-content modeling. Benchmarking on SARD demonstrates substantial improvements in page-level recognition robustness across mainstream OCR models—including traditional engines and vision-language models—with average line-level accuracy rising by 12.7% over prior benchmarks.

Technology Category

Application Category

📝 Abstract

Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training. Its synthetic nature provides unparalleled scalability and allows for precise control over layout and content variation. We detail the dataset's composition and generation process and provide benchmark results for several OCR models, including traditional and deep learning approaches, highlighting the challenges and opportunities presented by this dataset. SARD serves as a valuable resource for developing and evaluating robust OCR and vision-language models capable of processing diverse Arabic book-style texts.

Problem

Research questions and friction points this paper is trying to address.

Lack of large diverse datasets for Arabic OCR training

Existing datasets lack book-style layout complexity

Need for clean scalable synthetic Arabic text data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset for Arabic OCR

Large-scale book-style text simulation

Clean controlled training environment

🔎 Similar Papers

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition