Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses OCR in dynamic video scenes by introducing the first VLM-oriented video OCR benchmark—comprising 1,477 frames from real-world settings including code editors, news broadcasts, and YouTube videos—and establishing the first evaluation paradigm for VLM-based OCR in dynamic visual environments. Methodologically, we systematically benchmark state-of-the-art multimodal large models (e.g., Claude-3, Gemini-1.5, GPT-4o) against traditional OCR engines (e.g., EasyOCR, RapidOCR), employing Word Error Rate (WER), Character Error Rate (CER), and Accuracy as complementary metrics. Results show that VLMs significantly outperform traditional methods across most video OCR tasks (average WER reduction of 22.6%), yet exhibit limited robustness to occlusion, stylized fonts, and platform-level security restrictions. To foster reproducible research, we open-source a high-quality, multi-source annotated dataset, a unified evaluation framework, and all implementation code—providing the first standardized, extensible benchmark for video OCR.

Technology Category

Application Category

📝 Abstract

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs on video OCR tasks

Compare VLMs with traditional OCR systems

Address challenges in dynamic video OCR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models evaluation

Dynamic video OCR benchmark

Open-source dataset provided

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs