DREAM: Document Reconstruction via End-to-end Autoregressive Model

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Document reconstruction faces two key challenges: severe error propagation in multi-stage pipelines and the inability of existing generative models to preserve critical layout information. To address these, we propose DREAM, the first end-to-end autoregressive model for unified sequential reconstruction of text, tables, mathematical formulas, and layout elements. DREAM jointly models document layout analysis, OCR, table structure recognition, formula recognition, and reading-order detection—thereby mitigating error propagation and explicitly encoding spatial relationships. We further introduce a standardized task framework, the large-scale DocRec1K benchmark dataset, and a dedicated evaluation metric, the Document Structure Metric (DSM). Experiments demonstrate that DREAM achieves state-of-the-art performance on the primary reconstruction task and multiple constituent subtasks, significantly improving structural completeness and multi-task compatibility.

Technology Category

Application Category

📝 Abstract
Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.
Problem

Research questions and friction points this paper is trying to address.

Overcoming error propagation in multi-stage document reconstruction methods
Preserving layout information in end-to-end document reconstruction
Standardizing evaluation for document reconstruction performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end autoregressive model for document reconstruction
Standardized task definition with Document Similarity Metric
Comprehensive document element sequence generation
🔎 Similar Papers
No similar papers found.
X
Xin Li
Tencent YouTu Lab
Mingming Gong
Mingming Gong
University of Melbourne & Mohamed bin Zayed University of Artificial Intelligence
Causal InferenceMachine LearningComputer Vision
Y
Yunfei Wu
Tencent YouTu Lab
J
Jianxin Dai
Tencent YouTu Lab
A
Antai Guo
Tencent YouTu Lab
Xinghua Jiang
Xinghua Jiang
Tencent Youtu Lab
H
Haoyu Cao
Tencent YouTu Lab
Y
Yinsong Liu
Tencent YouTu Lab
Deqiang Jiang
Deqiang Jiang
腾讯优图实验室
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent