Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in text recognition caused by domain shift between synthetic and real-text images, this paper proposes a multi-mask strategy-driven self-supervised representation learning framework. The method introduces, for the first time, both blockwise and span masking into masked modeling of text images to explicitly capture character-level contextual relationships. Within a Masked Autoencoder (MAE) architecture, it jointly optimizes contrastive learning and masked image modeling (MIM) by synergistically fusing low-level texture and high-level semantic features. Extensive experiments demonstrate that the proposed approach significantly outperforms existing self-supervised methods on downstream tasks—including text recognition, text segmentation, and text image super-resolution—and achieves state-of-the-art (SOTA) performance across multiple real-world benchmarks.

Technology Category

Application Category

📝 Abstract
Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can mask the continuous image patches and completely remove some characters, forcing the model to infer relationships among characters within a word. Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations. After fine-tuning with real data, MMS outperforms the state-of-the-art self-supervised methods in various text-related tasks, including text recognition, segmentation, and text-image super-resolution.
Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap between synthetic and real-world text images
Enhancing high-level contextual learning in text recognition
Integrating multiple masking strategies for improved textual representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces random blockwise and span masking
Integrates multiple masking strategies in MIM
Jointly learns low and high-level textual representations
🔎 Similar Papers
No similar papers found.
Z
Zhengmi Tang
Wenzhou University Artificial Intelligence and Advanced Manufacturing Institute (AIAMI), Wenzhou City, China
Y
Yuto Mitsui
Graduate School of Engineering, Tohoku University, Sendai, Japan
T
Tomo Miyazaki
Graduate School of Engineering, Tohoku University, Sendai, Japan
Shinichiro Omachi
Shinichiro Omachi
Professor of Engineering, Tohoku University
pattern recognitionimage processingmachine learning