Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical gap in large multimodal models’ (LMMs) ability to perform complex logical reasoning using OCR-derived cues. To overcome the limitations of existing OCR benchmarks—which predominantly emphasize simple visual question answering and neglect higher-order reasoning—we introduce Reasoning-OCR, the first dedicated benchmark for OCR-augmented logical reasoning. It comprises 150 carefully curated questions across six visually diverse scenarios and targets six reasoning types: causal, temporal, constraint satisfaction, compositional, counterfactual, and multi-step deduction—while minimizing domain-specific knowledge requirements. We propose an OCR cue-driven evaluation paradigm integrating textual content and spatial layout information, leveraging multimodal prompting, structured reasoning chains, and cross-scenario generalization assessment. A systematic evaluation of over a dozen state-of-the-art open- and closed-source LMMs reveals a pronounced capability gap in high-level logical reasoning. This work establishes the first standardized, reproducible benchmark and evaluation framework for advancing OCR-enhanced reasoning in LMMs.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' ability to solve complex logical reasoning using OCR cues
Introducing Reasoning-OCR benchmark for diverse visual-text reasoning challenges
Assessing and improving LMMs' performance in OCR-based complex reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Multimodal Models for OCR reasoning
Introduces Reasoning-OCR benchmark for evaluation
Covers six visual scenarios and reasoning challenges
🔎 Similar Papers
No similar papers found.
H
Haibin He
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China
Maoyuan Ye
Maoyuan Ye
Wuhan University
CVOCRLLMMLLM
J
Jing Zhang
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China
X
Xiantao Cai
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China
J
Juhua Liu
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain
Dacheng Tao
Dacheng Tao
Nanyang Technological University
artificial intelligencemachine learningcomputer visionimage processingdata mining