🤖 AI Summary
This work addresses the challenge of pretraining Transformer-based text recognition models on unlabeled data. We propose a self-supervised masked pretraining method specifically designed for raw text-line images. Our approach is built upon the masked language modeling (MLM) paradigm and incorporates three key innovations: (1) a novel progressive masking strategy that dynamically increases the masking ratio over training epochs; (2) a dual-region joint loss function that jointly optimizes masked and unmasked regions to enhance feature consistency; and (3) the first large-scale pretraining framework requiring only 50 million unlabeled text-line images—without any human annotations. Combined with multi-scale fine-tuning, our method achieves up to a 30% reduction in character error rate across multiple standard benchmarks, matching the performance of supervised transfer learning while drastically reducing reliance on labeled data.
📝 Abstract
Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.