🤖 AI Summary
This study addresses the challenges posed by heterogeneous content—comprising text, tables, and images—in financial PDF documents for retrieval-augmented generation (RAG) systems, an area lacking systematic evaluation of parsing and chunking strategies. The work presents the first end-to-end empirical investigation of PDF processing pipelines in the context of financial document question answering. It introduces a new benchmark, TableQuest, integrates existing datasets, and systematically evaluates diverse PDF parsers and text chunking methods—including overlap mechanisms—on downstream QA performance. The findings reveal a strong correlation between preservation of document structure and answer accuracy. To support reproducibility and future research, the authors release their benchmark publicly and provide actionable, empirically grounded guidelines for designing effective RAG pipelines tailored to complex financial documents.
📝 Abstract
PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.