🤖 AI Summary
Low-quality printed document OCR suffers from poor accuracy, particularly under large inter-domain degradation variations and high intra-domain character shape redundancy—redundancy that existing methods fail to exploit effectively.
Method: We propose an unsupervised iterative error-correction framework. Its core innovation is the first systematic modeling of intra-document character shape redundancy, realized through a joint optimization pipeline integrating an extended Gaussian Mixture Model (GMM), intra-class re-alignment, and statistical normality testing, all driven by the EM algorithm for iterative clustering and recognition refinement.
Results: The method significantly improves OCR accuracy on multi-level degraded documents. It has been successfully deployed in digitizing Uruguayan military archives and European historical newspapers (17th–mid-20th century), demonstrating strong effectiveness and generalizability for real-world ancient document OCR correction.
📝 Abstract
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.