Relation-Rich Visual Document Generator for Visual Information Extraction

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual information extraction (VIE) from relation-intensive visual documents is hindered by layout diversity and scarce annotated data. Method: This paper proposes a two-stage content-layout joint modeling framework: (1) leveraging large language models to generate hierarchical structured text with entity types and relations; (2) driving a layout generation network using OCR outputs to synthesize diverse, content-aware documents. Contribution/Results: Our approach is the first to eliminate reliance on handcrafted templates or rules and overcome the topology–semantics disconnection bottleneck. It enables fully unsupervised synthesis of documents exhibiting high layout diversity, strong content-layout consistency, and faithful relational structure—without any human annotations. Extensive experiments demonstrate significant performance gains across multiple VIE benchmarks. This work establishes a novel paradigm for relation-aware document synthesis, advancing scalable, annotation-free VIE model development.

Technology Category

Application Category

📝 Abstract
Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .
Problem

Research questions and friction points this paper is trying to address.

Extracting visual information from relation-rich documents with diverse layouts
Overcoming data scarcity in synthetic document generation without manual templates
Generating documents with complex content-layout associations using OCR results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Structure Text for content generation
Content-driven Layout Generation from OCR
Two-stage approach for diverse document creation
🔎 Similar Papers
No similar papers found.
Z
Zi-Han Jiang
National Taiwan University
C
Chien-Wei Lin
National Taiwan University
W
Wei-Hua Li
National Taiwan University
H
Hsuan-Tung Liu
E.SUN Financial Holding Co., Ltd.
Yi-Ren Yeh
Yi-Ren Yeh
National Kaohsiung Normal University
Machine LearningComputer VisionManifold Learning
Chu-Song Chen
Chu-Song Chen
National Taiwan University
deep learningpattern recognitioncomputer visionimage processingmultimedia