Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Indian scene text recognition faces core challenges including script diversity, non-standard fonts, variable handwriting styles, and a scarcity of high-quality annotated data and open-source models. To address these, we introduce BSTD—the first large-scale, multi-task benchmark dataset for Indian languages—covering 11 Indian languages plus English, with meticulously curated, human-annotated scene images supporting four tasks: text detection, script identification, word recognition, and end-to-end recognition. We adapt state-of-the-art English pre-trained models via transfer learning and cross-lingual fine-tuning to enhance multilingual performance. All data and models are publicly released. Comprehensive experiments reveal substantial performance bottlenecks of existing methods on Indian languages, establishing BSTD as a rigorous, standardized benchmark and providing clear directions for future research in multilingual scene text understanding.

Technology Category

Application Category

📝 Abstract
Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
Problem

Research questions and friction points this paper is trying to address.

Addressing Indian language scene text recognition challenges
Introducing comprehensive dataset for 11 Indian languages
Benchmarking adapted English models for Indian script diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced large-scale Indian language scene text dataset
Evaluated English models adapted for Indian languages
Open-sourced comprehensive dataset and benchmark models
🔎 Similar Papers
No similar papers found.
A
Anik De
Indian Institute of Technology Jodhpur, Jodhpur, 342030, Rajasthan, India
Abhirama Subramanyam Penamakuri
Abhirama Subramanyam Penamakuri
PhD Scholar, IIT Jodhpur
Vision-Language ModelsLarge Language ModelsCVNLP
R
Rajeev Yadav
Indian Institute of Technology Jodhpur, Jodhpur, 342030, Rajasthan, India
A
Aditya Rathore
Indian Institute of Technology Jodhpur, Jodhpur, 342030, Rajasthan, India
H
Harshiv Shah
Indian Institute of Technology Jodhpur, Jodhpur, 342030, Rajasthan, India
D
Devesh Sharma
Indian Institute of Technology Jodhpur, Jodhpur, 342030, Rajasthan, India
S
Sagar Agarwal
Indian Institute of Technology Jodhpur, Jodhpur, 342030, Rajasthan, India
Pravin Kumar
Pravin Kumar
MIT
Molecular BiologyBiochemistry
Anand Mishra
Anand Mishra
IIT Jodhpur
Computer VisionMachine Learning