🤖 AI Summary
This work proposes a visual question answering (VQA)-based data augmentation framework for optical character recognition (OCR), addressing the limitation of conventional OCR models that directly predict transcriptions without fine-grained reasoning about text structure. For the first time, structured VQA tasks are introduced into both scene and handwritten text recognition by automatically generating natural language questions concerning character existence, position, and frequency, with answers derived from ground-truth transcriptions. This approach encourages the model to jointly reason over visual and semantic information through multi-task learning, thereby enhancing character-level understanding and visual-text alignment. Experimental results demonstrate significant reductions in character error rate (CER) and word error rate (WER) on the WordArt and Esposalles datasets, outperforming existing baselines.
📝 Abstract
Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant reductions in both CER and WER. Our code is publicly available at https://github.com/xuyaooo/DataAugOCR.