🤖 AI Summary
Current large multimodal language models (MLLMs) exhibit limited performance on OCR tasks, struggling to balance accuracy and generalization. To address this, we propose Ocean-OCR—a dedicated 3B-parameter multimodal large model for universal OCR. Methodologically, it introduces: (1) a novel Native Resolution Vision Transformer (ViT) architecture enabling end-to-end processing of arbitrary-resolution images, thereby eliminating text distortion caused by conventional resizing; (2) a multi-stage OCR instruction-tuning strategy leveraging high-quality synthetic and real-world data spanning documents, scene text, and handwritten scripts. Empirically, Ocean-OCR achieves state-of-the-art results across major OCR benchmarks—outperforming specialized models including TextIn and PaddleOCR—while retaining strong general-purpose vision-language understanding capabilities. This work bridges the gap between task-specific optimization and multimodal generality, achieving an organic unification of OCR specialization and broad multimodal competence.
📝 Abstract
Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present extbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.