BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of large-scale benchmark datasets in historical handwritten text recognition and author retrieval that simultaneously support multilingual content, long temporal spans, and rich metadata. The authors introduce a new dataset derived from the 16th-century correspondence archive of Heinrich Bullinger, comprising over 20,000 pages and nearly 500,000 lines of multilingual handwritten text produced by 796 authors over six decades, annotated with precise temporal and identity metadata. They propose a novel time-aware nDCG evaluation metric and employ TrOCR for text recognition—achieving a character error rate of 9.1%—and develop a deep feature–based, temporally constrained author retrieval method that attains a mean average precision of 78.3%. Their results demonstrate the efficacy of temporal coherence in author retrieval while highlighting the challenges posed by long-term handwriting style evolution.
📝 Abstract
We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six decades, featuring stylistic variation, multilingual content (mostly Latin and Early New High German) as well as meta-information such as writer identity and time. We evaluate BullingerDB on text recognition and writer retrieval. TrOCR, the best performing model, achieves a CER of 9.1%. For writer retrieval, we introduce a temporal nDCG metric to assess time-aware retrieval. While temporally coherent retrieval is achievable, mAP (78.3%) scores indicate challenges due to long-term stylistic variation. With BullingerDB, we aim to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.
Problem

Research questions and friction points this paper is trying to address.

handwritten text recognition
writer retrieval
historical documents
multilingual
temporal analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

historical document analysis
handwritten text recognition
writer retrieval
temporal nDCG
multilingual dataset