PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

📅 2025-05-26

📈 Citations: 1

✨ Influential: 0

career value

150K/year

🤖 AI Summary

To address low OCR accuracy caused by degradation in historical document images, this paper proposes a two-stage end-to-end optimization framework. In the first stage, a U-Net–based image restoration model is trained on a synthetically generated multi-degradation dataset to jointly optimize visual clarity and linguistic consistency. In the second stage, a semantic-aware ByT5 model performs post-OCR error correction, enhanced by a multi-directional block extraction and fusion mechanism tailored for large-format documents. The key innovations include the first joint optimization of image restoration quality and text semantic consistency, and the construction of the first cross-lingual (English/French/Spanish) synthetic dataset for historical text. Evaluated on 13,831 pages of real historical documents, the framework reduces character error rate by 63.9–70.3% over baseline OCR systems, demonstrating substantial improvement.

Technology Category

Application Category

📝 Abstract

This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to improve text extraction from degraded historical documents. Our key innovation lies in jointly optimizing image clarity and linguistic consistency. First, we generate synthetic image pairs with randomized text fonts, layouts, and degradations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-corrector, fine-tuned on synthetic historical text training pairs, addresses any remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

Problem

Research questions and friction points this paper is trying to address.

Improving text extraction from degraded historical documents

Combining image restoration with OCR error correction

Reducing character error rates in multilingual historical texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pipeline for document restoration and OCR

Synthetic data training with multi-directional patch fusion

ByT5 post-corrector for linguistic error correction

🔎 Similar Papers

No similar papers found.