DocRevive: A Unified Pipeline for Document Text Restoration

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study addresses the challenge of text recognition in degraded documents caused by damage, occlusion, or missing content by proposing the first end-to-end unified framework for document text restoration. The approach synergistically integrates optical character recognition (OCR), occlusion detection, masked language modeling, and diffusion-based image inpainting to achieve high-fidelity reconstruction with consistent semantics and visual style. Key contributions include the creation of OPRB, a large-scale synthetic dataset comprising 30,078 degraded document images, and the design of UCSM, a unified evaluation metric that jointly considers edit distance, semantic coherence, and contextual predictability. Experimental results demonstrate that the proposed method substantially improves text restoration quality, establishing a new benchmark and offering a practical tool for digital archiving and historical document preservation.

Technology Category

Application Category

📝 Abstract

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

Problem

Research questions and friction points this paper is trying to address.

document text restoration

damaged text

occluded text

incomplete text

document understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

document text restoration

diffusion-based inpainting

masked language modeling