🤖 AI Summary
This work exposes the cascading degradation caused by OCR errors in RAG systems: semantic and formatting noise introduced during external knowledge base construction severely impairs retrieval and generation performance. To address this, the authors introduce OHRBench—the first unified OCR-RAG evaluation benchmark—comprising 350 real-world PDFs and multimodal question-answering tasks. They formally define and quantify two types of OCR noise and propose a controllable noise injection methodology. Furthermore, they pioneer an OCR-free paradigm leveraging vision-language models (VLMs) for direct document understanding. Experiments demonstrate that mainstream OCR engines fail to support high-quality RAG knowledge bases; RAG performance degrades significantly with increasing OCR noise; and VLM-based approaches outperform OCR+LLM baselines across multiple tasks, validating both the efficacy and necessity of this paradigm shift.
📝 Abstract
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench