🤖 AI Summary
Existing PDF parsers often fail to capture critical visual content, erroneously extract irrelevant images, and struggle to accurately associate figures with their captions—limitations that significantly hinder the performance of multimodal retrieval-augmented generation (RAG) systems. To address these challenges, this work proposes a lightweight, production-oriented visual PDF parsing framework that, for the first time in a production setting, integrates spatial heuristics, document layout analysis, and semantic similarity to achieve robust, low-latency detection of visual elements and precise figure–caption alignment. Evaluated on both public and internal datasets, the method attains detection accuracy of at least 96% and caption association accuracy of 93%. When deployed as a RAG preprocessing module, it substantially outperforms current state-of-the-art approaches while reducing inference latency by more than twofold.
📝 Abstract
PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96\%$ visual element detection accuracy and $93\%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.