🤖 AI Summary
This work addresses the lack of dedicated evaluation benchmarks for logical reasoning over text-dense images in large multimodal models (LMMs). We introduce LogicOCR, the first benchmark explicitly designed for logical reasoning on such images: it comprises 1,100 multiple-choice questions derived from Chinese public examination corpora, generated via a GPT-4V-driven automated image-text synthesis pipeline and rigorously validated by human experts to ensure high fidelity and diversity. Our evaluation paradigm is weakly domain-dependent but strongly logic-oriented, supported by a scalable synthesis pipeline covering layout, typography, and visual realism. We conduct systematic evaluations of both open- and closed-source LMMs using chain-of-thought prompting, test-time scaling, and multidimensional attribution analysis. Results reveal a substantial performance gap between text-dense image reasoning and pure textual reasoning; further, we identify critical bottlenecks—including sensitivity to modality input format and image-text orientation—as key limitations hindering current LMM capabilities.
📝 Abstract
Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at https://github.com/MiliLab/LogicOCR.