VISTA-OCR: Towards generative and interactive end to end OCR models

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional OCR suffers from fragmented detection and recognition pipelines, high deployment costs of vision-language large models (VLLMs), and limited support for content-driven localization and interactive tasks. To address these challenges, we propose VISTA-OCR—a lightweight, end-to-end generative OCR model. Methodologically, it introduces (1) the first prompt-controllable pretraining OCR task; (2) a unified single-branch Transformer decoder that jointly generates textual content and spatial coordinates; and (3) interactive recognition capable of handling mixed printed and handwritten text. Trained via multi-stage progressive learning, multimodal token modeling, and a synthetic–real hybrid dataset, VISTA-OCR surpasses specialized state-of-the-art methods on multiple standard benchmarks. Its 150M-parameter variant, VISTAₒₘₙᵢ, significantly reduces computational overhead while maintaining competitive accuracy and enabling efficient deployment.

Technology Category

Application Category

📝 Abstract
We introduce extbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization ef{content_based_localization}, we introduce new prompt-controllable OCR tasks during pre-training.To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA$_{ ext{omni}}$ variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Unifies text detection and recognition in one generative model
Addresses high computational cost of Vision Large Language Models
Enables interactive OCR with prompt-controllable tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified generative model for text detection and recognition
Transformer decoder generates text and spatial coordinates
Prompt-controllable OCR tasks with lightweight architecture
🔎 Similar Papers
No similar papers found.
L
Laziz Hamdi
LITIS, Rouen, Normandie
L
Laziz Hamdi
Malakoff Humanis, Paris
A
Amine Tamasna
Malakoff Humanis, Paris
P
Pascal Boisson
Malakoff Humanis, Paris
Thierry Paquet
Thierry Paquet
University of Rouen Normandy, LITIS
Machine LearningHandwriting RecognitionReading SystemsDocument Image Analysis