OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses post-OCR error correction for historical documents, systematically evaluating the generalization and practicality of open-source large language models (LLaMA, Phi series) in bilingual English–Finnish settings. We propose a joint optimization framework tailored to historical texts, incorporating dynamic input segmentation, 4-bit quantization, and context-aware continuation strategies. Crucially, we identify— for the first time—the performance bottlenecks of LLMs in low-resource language OCR correction, specifically for Finnish. Experimental results show significant CER reduction on English, validating method efficacy; however, Finnish performance remains below practical thresholds due to scarce linguistic resources, training data distribution shift, and historical text characteristics—including archaic orthography and layout-induced noise. The work provides a reproducible technical pathway for multilingual historical document proofreading and delivers critical, causally grounded analysis of cross-lingual LLM limitations in heritage text processing.

Technology Category

Application Category

📝 Abstract
Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
Problem

Research questions and friction points this paper is trying to address.

Language Models
OCR Error Correction
Historical Documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Language Models
OCR Error Correction
Historical Document Processing
🔎 Similar Papers
No similar papers found.
Jenna Kanerva
Jenna Kanerva
Department of Computing, University of Turku
Natural Language ProcessingMachine Learning
C
Cassandra Ledins
TurkuNLP, Department of Computing, University of Turku, Finland
S
Siiri Kapyaho
TurkuNLP, Department of Computing, University of Turku, Finland
Filip Ginter
Filip Ginter
University of Turku
language technologynatural language processing