CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

πŸ“… 2025-08-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Handwritten text recognition of 16th-century Bullinger’s letters suffers from annotation errors caused by hyphenation artifacts. Method: We propose a CTC-based self-training approach that achieves precise alignment between image lines and full transcriptions via dynamic programming and model output probabilities. Observing that weaker models exhibit greater robustness in alignment, we design an iterative self-training strategy to jointly improve recognition and alignment accuracy. The method employs a sequence-to-sequence model trained with CTC loss and integrates the PyLaia framework for text-line recognition and evaluation. Contribution/Results: Experiments show a 1.1 percentage-point reduction in character error rate and substantial gains in alignment accuracy. We publicly release a new annotated subset comprising 100 manually corrected pages, along with full source code, enabling reproducible research and advancement in historical document OCR.

Technology Category

Application Category

πŸ“ Abstract
Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.
Problem

Research questions and friction points this paper is trying to address.

Addressing annotation errors in historical handwritten documents
Improving alignment accuracy for text line images
Reducing hyphenation issues in 16th-century letter collections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training method with CTC alignment algorithm
Dynamic programming for transcription-image matching
Iterative training strategy with weaker models
πŸ”Ž Similar Papers
No similar papers found.
Marco Peer
Marco Peer
HEIA-FR
Document AnalysisComputer VisionMachine Learning
A
Anna Scius-Bertrand
University of Applied Sciences and Arts Western Switzerland, Fribourg, Switzerland
A
Andreas Fischer
University of Fribourg, Switzerland