When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses a critical gap in optical character recognition (OCR) evaluation by demonstrating that conventional character-level metrics fail to capture OCR performance in industrial retrieval-augmented generation (RAG) systems. The authors introduce a robust benchmark tailored to OCR-first RAG pipelines, encompassing 11 challenging document types. Their analysis reveals a significant disconnect between high OCR accuracy and effective RAG performance: structural and semantic errors—whose impact varies across document categories—frequently cause retrieval failures. Through a comprehensive assessment integrating character error rate (CER), word error rate (WER), and downstream task effectiveness, the study shows that state-of-the-art OCR models exhibit substantial performance degradation on real-world industrial documents. These findings underscore that traditional OCR metrics are insufficient predictors of RAG utility. The benchmark is publicly released to support future research.

📝 Abstract

Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.

Problem

Research questions and friction points this paper is trying to address.

OCR robustness

Retrieval-Augmented Generation

document understanding

downstream performance

industrial benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR robustness

Retrieval-Augmented Generation

industrial document benchmark