Ocean-OCR: Towards General OCR Application via a Vision-Language Model

📅 2025-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal language models (MLLMs) exhibit limited performance on OCR tasks, struggling to balance accuracy and generalization. To address this, we propose Ocean-OCR—a dedicated 3B-parameter multimodal large model for universal OCR. Methodologically, it introduces: (1) a novel Native Resolution Vision Transformer (ViT) architecture enabling end-to-end processing of arbitrary-resolution images, thereby eliminating text distortion caused by conventional resizing; (2) a multi-stage OCR instruction-tuning strategy leveraging high-quality synthetic and real-world data spanning documents, scene text, and handwritten scripts. Empirically, Ocean-OCR achieves state-of-the-art results across major OCR benchmarks—outperforming specialized models including TextIn and PaddleOCR—while retaining strong general-purpose vision-language understanding capabilities. This work bridges the gap between task-specific optimization and multimodal generality, achieving an organic unification of OCR specialization and broad multimodal competence.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present extbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Language Models
Text Recognition
Performance Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ocean-OCR
Multimodal Large Language Model
Superior Optical Character Recognition
🔎 Similar Papers
2024-07-17European Conference on Computer VisionCitations: 2
S
Song Chen
Baichuan Inc.
Xinyu Guo
Xinyu Guo
Samsung Research America
AIcomputer visionmachine learningmedical image analysis
Y
Yadong Li
Baichuan Inc.
T
Tao Zhang
Baichuan Inc.
Mingan Lin
Mingan Lin
baichuan-inc
LLM、MLLM、AI
D
Dongdong Kuang
Baichuan Inc., Beihang University
Y
Youwei Zhang
Baichuan Inc., Beijing University of Posts and Telecommunications
Lingfeng Ming
Lingfeng Ming
Alibaba Group
Large Language ModelNatural Language Processing
F
Fengyu Zhang
Baichuan Inc.
Y
Yuran Wang
Baichuan Inc., Wuhan University
Jianhua Xu
Jianhua Xu
University of Electronic Science and Technology of China
Multi-Agent、Evolutionary Games、LLM-Agents
Z
Zenan Zhou
Baichuan Inc.
W
Weipeng Chen
Baichuan Inc.