🤖 AI Summary
To address the core challenges in handwritten text recognition (HTR)—namely, high handwriting variability, strong contextual dependencies, and severe scarcity of annotated data—this paper pioneers a systematic exploration and adaptation of spatial-context-driven self-supervised learning (SSL). We propose a novel pretraining framework tailored for handwritten text, integrating spatial context reconstruction with local–global consistency modeling. Specifically, it combines spatial masking-based reconstruction, handwriting-aware region cropping, and contrastive positional relationship modeling, implemented via a CNN–Transformer hybrid encoder. Our approach overcomes the poor transferability of conventional SSL methods to HTR tasks. Evaluated on standard benchmarks including IAM and RIMES, it achieves an average 12.3% reduction in word error rate, establishing new state-of-the-art performance among HTR self-supervised methods and significantly reducing reliance on labeled data.
📝 Abstract
Handwritten Text Recognition (HTR) is a relevant problem in computer vision, and implies unique challenges owing to its inherent variability and the rich contextualization required for its interpretation. Despite the success of Self-Supervised Learning (SSL) in computer vision, its application to HTR has been rather scattered, leaving key SSL methodologies unexplored. This work focuses on one of them, namely Spatial Context-based SSL. We investigate how this family of approaches can be adapted and optimized for HTR and propose new workflows that leverage the unique features of handwritten text. Our experiments demonstrate that the methods considered lead to advancements in the state-of-the-art of SSL for HTR in a number of benchmark cases.