🤖 AI Summary
This work investigates the robustness of large language models (LLMs) for cross-lingual question answering (QA) over multilingual OCR text, systematically evaluating the impact of OCR-induced noise—including character insertion, deletion, and substitution—on QA performance. To this end, we introduce MultiOCR-QA, the first benchmark dataset specifically designed for assessing robustness in multilingual OCR-QA. It comprises 60,000 English, French, and German QA pairs derived exclusively from real-world historical document OCR outputs, rigorously validated by human annotators and augmented with controlled perturbations. The dataset supports fine-grained error-type annotation and cross-lingual evaluation. Experimental results demonstrate substantial performance degradation across mainstream LLMs under OCR noise, with accuracy drops exceeding 40% for some models—highlighting their acute vulnerability to digitization errors. This work bridges a critical gap in OCR-AI co-robustness evaluation and establishes a new methodological foundation and benchmark for developing reliable multilingual OCR-QA systems.
📝 Abstract
Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.