🤖 AI Summary
This work addresses the limited self-correction capability of current large vision-language models during multi-turn refinement, which often leads to repetitive and ineffective attempts. The authors propose a training-free iterative self-correction framework that introduces, for the first time, dual mechanisms of capability reflection and memory reflection. Capability reflection diagnoses errors and formulates correction strategies, while memory reflection reviews past attempts to avoid redundancy, enabling rigorous re-reasoning to refine outputs. This approach yields an intelligent OCR agent endowed with structured introspective abilities. Evaluated on OCRBench v2, the method surpasses the open-source state-of-the-art model InternVL3-8B by +2.0 in English and +1.2 in Chinese, and achieves leading performance on visual understanding (79.9) and reasoning (66.5) tasks.
📝 Abstract
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs'reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.