🤖 AI Summary
This study addresses the challenge of text recognition in degraded documents caused by damage, occlusion, or missing content by proposing the first end-to-end unified framework for document text restoration. The approach synergistically integrates optical character recognition (OCR), occlusion detection, masked language modeling, and diffusion-based image inpainting to achieve high-fidelity reconstruction with consistent semantics and visual style. Key contributions include the creation of OPRB, a large-scale synthetic dataset comprising 30,078 degraded document images, and the design of UCSM, a unified evaluation metric that jointly considers edit distance, semantic coherence, and contextual predictability. Experimental results demonstrate that the proposed method substantially improves text restoration quality, establishing a new benchmark and offering a practical tool for digital archiving and historical document preservation.
📝 Abstract
In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.