Improving OCR using internal document redundancy

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-quality printed document OCR suffers from poor accuracy, particularly under large inter-domain degradation variations and high intra-domain character shape redundancy—redundancy that existing methods fail to exploit effectively. Method: We propose an unsupervised iterative error-correction framework. Its core innovation is the first systematic modeling of intra-document character shape redundancy, realized through a joint optimization pipeline integrating an extended Gaussian Mixture Model (GMM), intra-class re-alignment, and statistical normality testing, all driven by the EM algorithm for iterative clustering and recognition refinement. Results: The method significantly improves OCR accuracy on multi-level degraded documents. It has been successfully deployed in digitizing Uruguayan military archives and European historical newspapers (17th–mid-20th century), demonstrating strong effectiveness and generalizability for real-world ancient document OCR correction.

Technology Category

Application Category

📝 Abstract
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
Problem

Research questions and friction points this paper is trying to address.

Leveraging document redundancy to improve OCR accuracy
Correcting imperfect OCR outputs using unsupervised methods
Enhancing character recognition in degraded historical documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised method leveraging document character redundancy
Extended Gaussian Mixture Model with EM algorithm
Intra-cluster realignment and statistical normality testing
🔎 Similar Papers
No similar papers found.
Diego Belzarena
Diego Belzarena
Universidad de la República, Uruguay
Computer VisionElectrical Engineering
S
Seginus Mowlavi
Université Paris-Saclay, ENS Paris-Saclay, Centre Borelli, Gif-sur-Yvette, France
Aitor Artola
Aitor Artola
postdoc at City University of Hong Kong
machine learningimage processingsignal processing
C
Camilo Mariño
Université Paris-Saclay, ENS Paris-Saclay, Centre Borelli, Gif-sur-Yvette, France
Marina Gardella
Marina Gardella
Centre Borelli, ENS Paris-Saclay, Université Paris-Saclay
Image processing
I
Ignacio Ramírez
IIE, Facultad de Ingenería, Universidad de la República, Uruguay
A
Antoine Tadros
Determinant France, Paris, France
Roy He
Roy He
City University of Hong Kong
Applied mathematicsimage processingnumerical analysisdeep learning theory
N
Natalia Bottaioli
Université Paris-Saclay, ENS Paris-Saclay, Centre Borelli, Gif-sur-Yvette, France
B
Boshra Rajaei
Sadjad University, Mashhad, Iran
Gregory Randall
Gregory Randall
Profesor Instituto de Ingenieria Electrica, Universidad de la República
segmentacióntratamiento de imágenesvisión artificial
J
Jean-Michel Morel
City University of Hong Kong, Kowloon, Hong Kong