CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Handwritten text recognition of 16th-century Bullinger’s letters suffers from annotation errors caused by hyphenation artifacts. Method: We propose a CTC-based self-training approach that achieves precise alignment between image lines and full transcriptions via dynamic programming and model output probabilities. Observing that weaker models exhibit greater robustness in alignment, we design an iterative self-training strategy to jointly improve recognition and alignment accuracy. The method employs a sequence-to-sequence model trained with CTC loss and integrates the PyLaia framework for text-line recognition and evaluation. Contribution/Results: Experiments show a 1.1 percentage-point reduction in character error rate and substantial gains in alignment accuracy. We publicly release a new annotated subset comprising 100 manually corrected pages, along with full source code, enabling reproducible research and advancement in historical document OCR.

Technology Category

Application Category

📝 Abstract

Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.

Problem

Research questions and friction points this paper is trying to address.

Addressing annotation errors in historical handwritten documents

Improving alignment accuracy for text line images

Reducing hyphenation issues in 16th-century letter collections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training method with CTC alignment algorithm

Dynamic programming for transcription-image matching

Iterative training strategy with weaker models

🔎 Similar Papers

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting