Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing self-supervised methods struggle to model the cross-scale hierarchical structure of scene text—from layout to character strokes—and are often distracted by background clutter. To address this, this work proposes MNSP, a unified self-supervised framework that, for the first time, integrates next-scale prediction with masked image reconstruction to explicitly learn the structural evolution of text across scales. Furthermore, a multi-scale language alignment module is introduced to preserve semantic consistency throughout the hierarchy. This approach effectively guides the model to attend to text regions, significantly enhancing robustness against extreme scale variations and diverse layouts. The method achieves an average accuracy of 86.2% on Union14M and 96.7% across six standard benchmark datasets.

📝 Abstract

Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP

Problem

Research questions and friction points this paper is trying to address.

Scene Text Recognition

Self-supervised Learning

Masked Image Modeling

Hierarchical Representation

Cross-scale Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Next-Scale Prediction

Self-supervised Learning

Scene Text Recognition