🤖 AI Summary
Current large multimodal models exhibit significant weaknesses in fine-grained vision-language tasks—including precise text localization, handwritten text recognition, document layout understanding, and logical reasoning.
Method: We introduce VTLR, the first large-scale bilingual benchmark for visual text localization and reasoning, comprising 10,000 human-verified multimodal question-answer pairs across 31 complex real-world scenarios. VTLR integrates spatial coordinate annotations, structured OCR labels, and logically grounded question stems. We propose a novel five-dimensional fine-grained evaluation framework—encompassing localization, handwriting recognition, layout analysis, structural parsing, and logical reasoning—and establish an attributable diagnostic system identifying five core capability deficits.
Contribution/Results: Evaluating 22 state-of-the-art models, we find 20 score below 50/100, exposing systemic bottlenecks. VTLR’s task scale is four times larger than prior benchmarks, and its modular, reproducible metrics enable granular, decomposable model assessment.
📝 Abstract
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.