🤖 AI Summary
This work addresses the critical gap in large multimodal models’ (LMMs) ability to perform complex logical reasoning using OCR-derived cues. To overcome the limitations of existing OCR benchmarks—which predominantly emphasize simple visual question answering and neglect higher-order reasoning—we introduce Reasoning-OCR, the first dedicated benchmark for OCR-augmented logical reasoning. It comprises 150 carefully curated questions across six visually diverse scenarios and targets six reasoning types: causal, temporal, constraint satisfaction, compositional, counterfactual, and multi-step deduction—while minimizing domain-specific knowledge requirements. We propose an OCR cue-driven evaluation paradigm integrating textual content and spatial layout information, leveraging multimodal prompting, structured reasoning chains, and cross-scenario generalization assessment. A systematic evaluation of over a dozen state-of-the-art open- and closed-source LMMs reveals a pronounced capability gap in high-level logical reasoning. This work establishes the first standardized, reproducible benchmark and evaluation framework for advancing OCR-enhanced reasoning in LMMs.
📝 Abstract
Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.