OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current large multimodal models exhibit significant weaknesses in fine-grained vision-language tasks—including precise text localization, handwritten text recognition, document layout understanding, and logical reasoning. Method: We introduce VTLR, the first large-scale bilingual benchmark for visual text localization and reasoning, comprising 10,000 human-verified multimodal question-answer pairs across 31 complex real-world scenarios. VTLR integrates spatial coordinate annotations, structured OCR labels, and logically grounded question stems. We propose a novel five-dimensional fine-grained evaluation framework—encompassing localization, handwriting recognition, layout analysis, structural parsing, and logical reasoning—and establish an attributable diagnostic system identifying five core capability deficits. Contribution/Results: Evaluating 22 state-of-the-art models, we find 20 score below 50/100, exposing systemic bottlenecks. VTLR’s task scale is four times larger than prior benchmarks, and its modular, reproducible metrics enable granular, decomposable model assessment.

Technology Category

Application Category

📝 Abstract

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.

Problem

Research questions and friction points this paper is trying to address.

Image Text Localization

Handwriting Recognition

Complex Content Analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCRBench v2

Diverse Evaluation Tasks

Comprehensive Text Understanding

🔎 Similar Papers

No similar papers found.