AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing visual multimodal RAG systems process entire document pages as input, often suffering from attention dilution and hallucination due to contextual redundancy and aggressive visual token compression. This work proposes a dynamic OCR parsing paradigm that, for the first time, models OCR as a query-driven, agent-like on-demand parsing mechanism. By leveraging “visual thinking” to autonomously analyze document layout, the system selectively recognizes only regions relevant to the query and integrates a dynamic visual token decompression technique to enable fine-grained text extraction. This approach decouples retrieval granularity from fixed page-level chunking, establishing dynamic OCR as the third core module in visual RAG—complementing embedding and re-ranking—and achieves expert-level performance by significantly improving both accuracy and efficiency on long-document understanding tasks.

Technology Category

Application Category

📝 Abstract

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a"thinking with images"manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the"third building block"of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.

Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented generation

visual documents

OCR

hallucination

multimodal RAG

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgenticOCR

retrieval-augmented generation

on-demand OCR