Improving OCR using internal document redundancy

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Low-quality printed document OCR suffers from poor accuracy, particularly under large inter-domain degradation variations and high intra-domain character shape redundancy—redundancy that existing methods fail to exploit effectively. Method: We propose an unsupervised iterative error-correction framework. Its core innovation is the first systematic modeling of intra-document character shape redundancy, realized through a joint optimization pipeline integrating an extended Gaussian Mixture Model (GMM), intra-class re-alignment, and statistical normality testing, all driven by the EM algorithm for iterative clustering and recognition refinement. Results: The method significantly improves OCR accuracy on multi-level degraded documents. It has been successfully deployed in digitizing Uruguayan military archives and European historical newspapers (17th–mid-20th century), demonstrating strong effectiveness and generalizability for real-world ancient document OCR correction.

Technology Category

Application Category

📝 Abstract

Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.

Problem

Research questions and friction points this paper is trying to address.

Leveraging document redundancy to improve OCR accuracy

Correcting imperfect OCR outputs using unsupervised methods

Enhancing character recognition in degraded historical documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised method leveraging document character redundancy

Extended Gaussian Mixture Model with EM algorithm

Intra-cluster realignment and statistical normality testing

🔎 Similar Papers

Task-driven single-image super-resolution reconstruction of document scans