MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the robustness of large language models (LLMs) for cross-lingual question answering (QA) over multilingual OCR text, systematically evaluating the impact of OCR-induced noise—including character insertion, deletion, and substitution—on QA performance. To this end, we introduce MultiOCR-QA, the first benchmark dataset specifically designed for assessing robustness in multilingual OCR-QA. It comprises 60,000 English, French, and German QA pairs derived exclusively from real-world historical document OCR outputs, rigorously validated by human annotators and augmented with controlled perturbations. The dataset supports fine-grained error-type annotation and cross-lingual evaluation. Experimental results demonstrate substantial performance degradation across mainstream LLMs under OCR noise, with accuracy drops exceeding 40% for some models—highlighting their acute vulnerability to digitization errors. This work bridges a critical gap in OCR-AI co-robustness evaluation and establishes a new methodological foundation and benchmark for developing reliable multilingual OCR-QA systems.

Technology Category

Application Category

📝 Abstract
Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
Problem

Research questions and friction points this paper is trying to address.

Evaluate robustness of LLMs
Multilingual OCR text QA
Impact of OCR errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual QA dataset introduced
Evaluates OCR error impact
Analyzes LLM robustness
🔎 Similar Papers
No similar papers found.