OCR-Agent: Agentic OCR with Capability and Memory Reflection

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited self-correction capability of current large vision-language models during multi-turn refinement, which often leads to repetitive and ineffective attempts. The authors propose a training-free iterative self-correction framework that introduces, for the first time, dual mechanisms of capability reflection and memory reflection. Capability reflection diagnoses errors and formulates correction strategies, while memory reflection reviews past attempts to avoid redundancy, enabling rigorous re-reasoning to refine outputs. This approach yields an intelligent OCR agent endowed with structured introspective abilities. Evaluated on OCRBench v2, the method surpasses the open-source state-of-the-art model InternVL3-8B by +2.0 in English and +1.2 in Chinese, and achieves leading performance on visual understanding (79.9) and reasoning (66.5) tasks.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs'reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.
Problem

Research questions and friction points this paper is trying to address.

self-correction
cognitive bias
visual understanding
iterative refinement
large vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Capability Reflection
Memory Reflection
Iterative Self-Correction
Visual-Language Models
OCR-Agent
🔎 Similar Papers
No similar papers found.
S
Shimin Wen
Southwest Minzu University
Z
Zeyu Zhang
AI Geeks
X
Xingdou Bian
Southwest Minzu University
Hongjie Zhu
Hongjie Zhu
beigene
BiostatisticsBioinformaticshigh-dimensional data
L
Lulu He
Southwest Minzu University
L
Layi Shama
Southwest Minzu University
D
Daji Ergu
Southwest Minzu University
Ying Cai
Ying Cai
Associate Professor, Department of Computer Science, Iowa State University
data privacy and confidentialityquery authentication and correctionmobile object managmentmultimedia communications