OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models exhibit significant weaknesses in fine-grained vision-language tasks—including precise text localization, handwritten text recognition, document layout understanding, and logical reasoning. Method: We introduce VTLR, the first large-scale bilingual benchmark for visual text localization and reasoning, comprising 10,000 human-verified multimodal question-answer pairs across 31 complex real-world scenarios. VTLR integrates spatial coordinate annotations, structured OCR labels, and logically grounded question stems. We propose a novel five-dimensional fine-grained evaluation framework—encompassing localization, handwriting recognition, layout analysis, structural parsing, and logical reasoning—and establish an attributable diagnostic system identifying five core capability deficits. Contribution/Results: Evaluating 22 state-of-the-art models, we find 20 score below 50/100, exposing systemic bottlenecks. VTLR’s task scale is four times larger than prior benchmarks, and its modular, reproducible metrics enable granular, decomposable model assessment.

Technology Category

Application Category

📝 Abstract
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.
Problem

Research questions and friction points this paper is trying to address.

Image Text Localization
Handwriting Recognition
Complex Content Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

OCRBench v2
Diverse Evaluation Tasks
Comprehensive Text Understanding
🔎 Similar Papers
No similar papers found.
Ling Fu
Ling Fu
Master student of Computer Science, Huazhong university of science and technology
computer vision
Biao Yang
Biao Yang
Shanghai Jiao Tong University, Antai College of Economics and Management
Asset PricingClimate Finance
Z
Zhebin Kuang
Huazhong University of Science and Technology
Jiajun Song
Jiajun Song
Michigan technological University
Wave Energy Converter
Y
Yuzhe Li
Huazhong University of Science and Technology
L
Linghao Zhu
Huazhong University of Science and Technology
Q
Qidi Luo
Huazhong University of Science and Technology
X
Xinyu Wang
University of Adelaide
H
Hao Lu
Huazhong University of Science and Technology
Mingxin Huang
Mingxin Huang
South China University of Technology
MLLMComputer VisionText Spotting
Z
Zhang Li
Huazhong University of Science and Technology
Guozhi Tang
Guozhi Tang
ByteDance Inc.
B
Bin Shan
ByteDance
C
Chunhui Lin
ByteDance
Q
Qi Liu
ByteDance
B
Binghong Wu
ByteDance
H
Hao Feng
ByteDance
H
Hao Liu
ByteDance
C
Can Huang
ByteDance
Jingqun Tang
Jingqun Tang
ByteDance Inc.
Computer VisionDocument IntelligenceMLLMMultimodal Generative Models
W
Wei Chen
Huazhong University of Science and Technology
Lianwen Jin
Lianwen Jin
Professor of Electronic and Information Engineering, South China University of Technology
Optical Character Recognition (OCR)Computer VisionDocument AIMultimodal LLMs
Y
Yuliang Liu
Huazhong University of Science and Technology
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR