CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the gap between existing OCR benchmarks and real-world enterprise document processing scenarios, which often fail to reflect the practical challenges faced in deployment. To bridge this disconnect, the authors introduce a comprehensive OCR evaluation benchmark grounded in real-world applications, encompassing five core tasks: text recognition, document parsing, layout localization, key information extraction, and document-based question answering. The benchmark comprises 7,093 challenging samples featuring diverse document types and complex layouts, with an emphasis on difficult and edge cases commonly encountered in practice. Systematic evaluation of 14 state-of-the-art large multimodal models (LMMs) on this benchmark reveals a significant performance drop compared to standard datasets, underscoring a substantial gap between current model capabilities and the demands of real-world deployment.

📝 Abstract

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.

Problem

Research questions and friction points this paper is trying to address.

Optical Character Recognition

Large Multimodal Models

Real-world Document Processing

Benchmarking

Document Literacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Multimodal Models

OCR benchmark

real-world document processing