🤖 AI Summary
This study investigates the underlying mechanism of cross-lingual transfer in low-resource handwritten text recognition (HTR), specifically disentangling whether its efficacy stems from shared visual representations or sequence modeling capabilities. Through controlled experiments on Arabic, Urdu, and Persian scripts, the authors compare CNN-only architectures against full CRNN models under both monolingual and multilingual training settings. Results demonstrate that, in low-resource regimes with 100–1,000 training samples, CRNNs trained multilingually significantly outperform CNN-only counterparts, with the largest reductions in character error rate (CER) observed under the most data-scarce conditions. The work provides the first clear evidence that sequence modeling—not visual similarity of character shapes—is the critical factor enabling effective cross-lingual transfer in HTR.
📝 Abstract
Handwritten Text Recognition (HTR) for Arabic-script languages benefits from cross-language joint training under low-resource conditions, particularly when using CRNN-based models that combine convolutional encoders with sequence modeling. However, it remains unclear whether these improvements are better explained by shared visual representations or sequence-level dependencies. In this work, we conduct a controlled architectural study of line-level Arabic-script HTR, comparing CNN-only models with CTC decoding and CRNN models under identical single-script and multi-script training regimes. Experiments are performed on Arabic (KHATT), Urdu (NUST-UHWR), and Persian (PHTD) datasets under low-resource settings (K in {100, 500, 1000}). Our results show a clear divergence in transfer behavior: while CNN-only models exhibit limited or unstable improvements, CRNN models achieve better performance under multi-script training, particularly in the most data-constrained regimes. Focusing on transfer improvements (delta CER) rather than absolute performance, we find that cross-language improvements are associated with sequence-level modeling, while sharing visual representations learned by the CNN encoder, corresponding to similarities in character shapes across scripts, alone appears to be insufficient. This finding suggests that contextual modeling plays an important role in enabling effective transfer in low-resource scenarios, and that similar behavior may extend to other low-resource language settings.