Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the challenge of pretraining Transformer-based text recognition models on unlabeled data. We propose a self-supervised masked pretraining method specifically designed for raw text-line images. Our approach is built upon the masked language modeling (MLM) paradigm and incorporates three key innovations: (1) a novel progressive masking strategy that dynamically increases the masking ratio over training epochs; (2) a dual-region joint loss function that jointly optimizes masked and unmasked regions to enhance feature consistency; and (3) the first large-scale pretraining framework requiring only 50 million unlabeled text-line images—without any human annotations. Combined with multi-scale fine-tuning, our method achieves up to a 30% reduction in character error rate across multiple standard benchmarks, matching the performance of supervised transfer learning while drastically reducing reliance on labeled data.

Technology Category

Application Category

📝 Abstract

Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.

Problem

Research questions and friction points this paper is trying to address.

Improving text recognition transformers via masked self-supervised pre-training

Enhancing model performance without relying on annotated data

Reducing character error rates significantly through innovative pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked self-supervised pre-training for transformers

Progressive masking probability increase strategy

Modified loss function for masked patches

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations