DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large vision-language models (LVLMs) suffer from severe hallucination and inferior generalization compared to specialized OCR expert models. To address this, we propose a reasoning-and-tool-augmented OCR framework that explicitly models a “chain of reasoning” and dynamically invokes external OCR experts as trusted tools to verify and refine LVLM outputs through multi-step interaction—thereby suppressing hallucination and improving accuracy. Unlike end-to-end fine-tuning, our framework decouples perception from reasoning, substantially reducing the computational and engineering overhead of LVLM iteration. Experiments demonstrate consistent superiority over baseline LVLMs and standalone expert models on ReST and OmniDocBench, achieving state-of-the-art performance with strong robustness and cross-domain generalization. This work establishes a new paradigm for enhancing LVLMs with high-precision OCR capabilities via tool-integrated, iterative reasoning.

Technology Category

Application Category

📝 Abstract

Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in OCR tasks by LVLMs

Improving OCR accuracy using expert model integration

Enhancing document parsing via reasoning-and-tool framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vision-language model with expert tools

Uses multi-step reasoning to reduce hallucinations

Integrates specialized OCR models for accuracy

🔎 Similar Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding