Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Existing self-supervised methods struggle to model the cross-scale hierarchical structure of scene text—from layout to character strokes—and are often distracted by background clutter. To address this, this work proposes MNSP, a unified self-supervised framework that, for the first time, integrates next-scale prediction with masked image reconstruction to explicitly learn the structural evolution of text across scales. Furthermore, a multi-scale language alignment module is introduced to preserve semantic consistency throughout the hierarchy. This approach effectively guides the model to attend to text regions, significantly enhancing robustness against extreme scale variations and diverse layouts. The method achieves an average accuracy of 86.2% on Union14M and 96.7% across six standard benchmark datasets.
📝 Abstract
Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP
Problem

Research questions and friction points this paper is trying to address.

Scene Text Recognition
Self-supervised Learning
Masked Image Modeling
Hierarchical Representation
Cross-scale Prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Next-Scale Prediction
Self-supervised Learning
Scene Text Recognition
Cross-scale Modeling
Multi-scale Linguistic Alignment
🔎 Similar Papers
No similar papers found.