CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

📅 2024-08-30

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Historical newspaper OCR suffers from high error rates due to complex layouts and textual degradation. Method: This paper proposes a context-aware language model (LM) post-processing correction framework that integrates socio-cultural context. It systematically validates, for the first time, the efficacy of socio-cultural prompting for OCR correction; designs a prompt-based infilling mechanism; and employs domain-adapted contextual injection strategies, enabling adaptive correction across seven Transformer-based models (e.g., BERT, RoBERTa, T5). Contributions/Results: We release NCSE—the first high-quality, manually annotated dataset for newspaper OCR correction (91 articles, 40K tokens). Our method achieves over 60% reduction in character error rate on NCSE and significantly improves downstream NER performance, as measured by Cosine Named Entity Similarity. Empirical analysis further demonstrates that misleading contextual prompts degrade performance, underscoring the critical role of socio-culturally grounded context.

Technology Category

Application Category

📝 Abstract

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

Problem

Research questions and friction points this paper is trying to address.

Optical Character Recognition

Complex Typography

Historical Documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Smart Language Models

Historical Information Utilization

Complex Layout OCR Improvement

🔎 Similar Papers

No similar papers found.