π€ AI Summary
To address low OCR accuracy caused by degradation in historical document images, this paper proposes a two-stage end-to-end optimization framework. In the first stage, a U-Netβbased image restoration model is trained on a synthetically generated multi-degradation dataset to jointly optimize visual clarity and linguistic consistency. In the second stage, a semantic-aware ByT5 model performs post-OCR error correction, enhanced by a multi-directional block extraction and fusion mechanism tailored for large-format documents. The key innovations include the first joint optimization of image restoration quality and text semantic consistency, and the construction of the first cross-lingual (English/French/Spanish) synthetic dataset for historical text. Evaluated on 13,831 pages of real historical documents, the framework reduces character error rate by 63.9β70.3% over baseline OCR systems, demonstrating substantial improvement.
π Abstract
This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to improve text extraction from degraded historical documents. Our key innovation lies in jointly optimizing image clarity and linguistic consistency. First, we generate synthetic image pairs with randomized text fonts, layouts, and degradations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-corrector, fine-tuned on synthetic historical text training pairs, addresses any remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.