π€ AI Summary
Handwritten text recognition of 16th-century Bullingerβs letters suffers from annotation errors caused by hyphenation artifacts.
Method: We propose a CTC-based self-training approach that achieves precise alignment between image lines and full transcriptions via dynamic programming and model output probabilities. Observing that weaker models exhibit greater robustness in alignment, we design an iterative self-training strategy to jointly improve recognition and alignment accuracy. The method employs a sequence-to-sequence model trained with CTC loss and integrates the PyLaia framework for text-line recognition and evaluation.
Contribution/Results: Experiments show a 1.1 percentage-point reduction in character error rate and substantial gains in alignment accuracy. We publicly release a new annotated subset comprising 100 manually corrected pages, along with full source code, enabling reproducible research and advancement in historical document OCR.
π Abstract
Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.