Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenges of language diversity, document heterogeneity, and deployment efficiency in OCR systems for multilingual and multimodal Indian government documents. The authors propose the Chitrapathak family of methods, which combines end-to-end training of a general-purpose vision-language model with efficient fine-tuning of pretrained OCR models for non-target languages. Additionally, they introduce Parichay, the first dedicated structured information extraction model tailored to nine categories of Indian governmental documents. Experimental results demonstrate that Chitrapathak-2 achieves state-of-the-art performance on Telugu (6.69 character-level ANLS) and ranks second on other languages, while offering 3–6× faster inference. Parichay attains an Exact Match score of 89.8% on key field extraction from government documents, demonstrating both high accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

Problem

Research questions and friction points this paper is trying to address.

OCR

multilingual

document heterogeneity

deployment constraints

India

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual OCR

fine-tuning strategy

Vision-Language Models