🤖 AI Summary
Existing OCR and layout analysis methods struggle to preserve semantic coherence while enabling efficient multimodal RAG for visually rich documents, where text and imagery are tightly coupled. To address this, we propose SCAN, a semantic layout analysis framework designed for visual-language models (VLMs). SCAN introduces a novel coarse-grained, VLM-friendly semantic partitioning paradigm: by fine-tuning a vision-based object detector, it jointly optimizes semantic consistency and spatial contiguity to adaptively segment documents into functionally coherent, granularity-appropriate regions. This unifies text- and vision-based RAG—previously optimized in isolation—enabling joint multimodal retrieval-augmented generation. Evaluated on bilingual (English and Japanese) benchmarks, SCAN achieves end-to-end improvements of +9.0% in textual RAG accuracy and +6.4% in visual RAG accuracy over state-of-the-art open-source and commercial baselines.
📝 Abstract
With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs can achieve better RAG performance, but processing rich documents still remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN ( extbf{S}emanti extbf{C} Document Layout extbf{AN}alysis), a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems working with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components. We trained the SCAN model by fine-tuning object detection models with sophisticated annotation datasets. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%, outperforming conventional approaches and even commercial document processing solutions.