Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the critical gap in large multimodal models’ (LMMs) ability to perform complex logical reasoning using OCR-derived cues. To overcome the limitations of existing OCR benchmarks—which predominantly emphasize simple visual question answering and neglect higher-order reasoning—we introduce Reasoning-OCR, the first dedicated benchmark for OCR-augmented logical reasoning. It comprises 150 carefully curated questions across six visually diverse scenarios and targets six reasoning types: causal, temporal, constraint satisfaction, compositional, counterfactual, and multi-step deduction—while minimizing domain-specific knowledge requirements. We propose an OCR cue-driven evaluation paradigm integrating textual content and spatial layout information, leveraging multimodal prompting, structured reasoning chains, and cross-scenario generalization assessment. A systematic evaluation of over a dozen state-of-the-art open- and closed-source LMMs reveals a pronounced capability gap in high-level logical reasoning. This work establishes the first standardized, reproducible benchmark and evaluation framework for advancing OCR-enhanced reasoning in LMMs.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' ability to solve complex logical reasoning using OCR cues

Introducing Reasoning-OCR benchmark for diverse visual-text reasoning challenges

Assessing and improving LMMs' performance in OCR-based complex reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Multimodal Models for OCR reasoning

Introduces Reasoning-OCR benchmark for evaluation

Covers six visual scenarios and reasoning challenges

🔎 Similar Papers

No similar papers found.