Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

📅 2024-06-13

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Historical handwritten Arabic manuscripts—particularly cursive scripts—are notoriously difficult to recognize, and high-quality annotated datasets are severely lacking. Method: This work introduces Muharaf, the first large-scale, open-source dataset of historical Arabic manuscript images, comprising over 1,600 pages from diverse document types (e.g., letters, poetry, legal texts), each annotated with expert-verified text-line polygon coordinates and page-structure labels. A novel data construction pipeline is proposed, integrating transcription alignment with spatial annotation. A CNN-based baseline model is trained and evaluated to validate the dataset’s utility. Contribution/Results: Muharaf is the first systematically curated Arabic manuscript dataset featuring fine-grained spatial annotations, thereby filling a critical gap in the Handwritten Text Recognition (HTR) community’s benchmark resources. It substantially strengthens data support for Arabic and other connected-script recognition tasks, and establishes a reproducible, community-accessible benchmark for historical document digitization and general cursive script analysis.

Technology Category

Application Category

📝 Abstract

We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

Problem

Research questions and friction points this paper is trying to address.

Develops Arabic handwritten text recognition dataset.

Enhances cursive text recognition technology.

Includes diverse Arabic manuscript document types.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning dataset creation

Handwritten text recognition advancement

Convolutional neural networks training

🔎 Similar Papers

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers