🤖 AI Summary
Historical handwritten Arabic manuscripts—particularly cursive scripts—are notoriously difficult to recognize, and high-quality annotated datasets are severely lacking. Method: This work introduces Muharaf, the first large-scale, open-source dataset of historical Arabic manuscript images, comprising over 1,600 pages from diverse document types (e.g., letters, poetry, legal texts), each annotated with expert-verified text-line polygon coordinates and page-structure labels. A novel data construction pipeline is proposed, integrating transcription alignment with spatial annotation. A CNN-based baseline model is trained and evaluated to validate the dataset’s utility. Contribution/Results: Muharaf is the first systematically curated Arabic manuscript dataset featuring fine-grained spatial annotations, thereby filling a critical gap in the Handwritten Text Recognition (HTR) community’s benchmark resources. It substantially strengthens data support for Arabic and other connected-script recognition tasks, and establishes a reproducible, community-accessible benchmark for historical document digitization and general cursive script analysis.
📝 Abstract
We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.