OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of systematic benchmarks for text-dense image reasoning hinders rigorous evaluation of multimodal large language models (MLLMs). Method: This paper introduces OCR-Reasoning, the first dedicated benchmark comprising 1,069 human-annotated samples, covering six core reasoning capabilities across 18 realistic场景 tasks, with simultaneous annotation of both reasoning chains and final answers. It proposes a dual-granularity evaluation paradigm—assessing both reasoning process fidelity and answer correctness—enabled by structured prompting and fine-grained evaluation protocols to jointly measure logical coherence and accuracy. Contribution/Results: Experiments reveal that all state-of-the-art MLLMs achieve less than 50% overall accuracy on OCR-Reasoning, exposing fundamental limitations in complex text-image joint reasoning. This benchmark enables precise, process-aware diagnosis of MLLM reasoning deficiencies and establishes a new standard for evaluating multimodal reasoning robustness.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' text-rich image reasoning lacking systematic benchmarks
Evaluating reasoning processes and final answers in visual scenarios
Revealing MLLMs' limitations with under 50% accuracy in OCR tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR-Reasoning benchmark for text-rich image reasoning
Annotates reasoning process and final answers
Evaluates MLLMs holistically on problem-solving abilities
🔎 Similar Papers
No similar papers found.