🤖 AI Summary
This work addresses the limitations of existing question-answering datasets, which are often confined to plain text or narrow domains and thus ill-suited for evaluating complex, real-world QA tasks over authentic PDF documents. To bridge this gap, the authors present pdfQA, the first systematically constructed multi-domain QA dataset specifically designed for real PDFs, comprising 2,000 human-annotated and 2,000 synthetically generated samples. The dataset is meticulously categorized along ten fine-grained complexity dimensions—including document type, answer type, and information location—and subjected to rigorous quality and difficulty filtering to ensure validity. Benchmark evaluations on open-source large language models reveal significant performance bottlenecks in PDF parsing, retrieval, and reasoning, establishing pdfQA as a high-quality, challenging benchmark for end-to-end QA systems.
📝 Abstract
PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).