🤖 AI Summary
This work addresses OCR in dynamic video scenes by introducing the first VLM-oriented video OCR benchmark—comprising 1,477 frames from real-world settings including code editors, news broadcasts, and YouTube videos—and establishing the first evaluation paradigm for VLM-based OCR in dynamic visual environments. Methodologically, we systematically benchmark state-of-the-art multimodal large models (e.g., Claude-3, Gemini-1.5, GPT-4o) against traditional OCR engines (e.g., EasyOCR, RapidOCR), employing Word Error Rate (WER), Character Error Rate (CER), and Accuracy as complementary metrics. Results show that VLMs significantly outperform traditional methods across most video OCR tasks (average WER reduction of 22.6%), yet exhibit limited robustness to occlusion, stylized fonts, and platform-level security restrictions. To foster reproducible research, we open-source a high-quality, multi-source annotated dataset, a unified evaluation framework, and all implementation code—providing the first standardized, extensible benchmark for video OCR.
📝 Abstract
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.