SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing OCR and layout analysis methods struggle to preserve semantic coherence while enabling efficient multimodal RAG for visually rich documents, where text and imagery are tightly coupled. To address this, we propose SCAN, a semantic layout analysis framework designed for visual-language models (VLMs). SCAN introduces a novel coarse-grained, VLM-friendly semantic partitioning paradigm: by fine-tuning a vision-based object detector, it jointly optimizes semantic consistency and spatial contiguity to adaptively segment documents into functionally coherent, granularity-appropriate regions. This unifies text- and vision-based RAG—previously optimized in isolation—enabling joint multimodal retrieval-augmented generation. Evaluated on bilingual (English and Japanese) benchmarks, SCAN achieves end-to-end improvements of +9.0% in textual RAG accuracy and +6.4% in visual RAG accuracy over state-of-the-art open-source and commercial baselines.

Technology Category

Application Category

📝 Abstract

With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs can achieve better RAG performance, but processing rich documents still remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN ( extbf{S}emanti extbf{C} Document Layout extbf{AN}alysis), a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems working with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components. We trained the SCAN model by fine-tuning object detection models with sophisticated annotation datasets. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%, outperforming conventional approaches and even commercial document processing solutions.

Problem

Research questions and friction points this paper is trying to address.

Enhancing textual and visual RAG systems for rich documents

Balancing semantic granularity with processing efficiency in VLM

Improving RAG performance via semantic document layout analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-friendly semantic document layout analysis

Coarse-grained division into coherent regions

Fine-tuned object detection with annotated datasets

🔎 Similar Papers

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation