🤖 AI Summary
This study addresses the challenges of historical newspaper OCR—such as long texts, poor print quality, and complex layouts—where existing Transformer-based models struggle with paragraph-level transcription due to their quadratic computational complexity. The work proposes the first application of state space models to OCR, introducing an efficient Mamba-based architecture that integrates a CNN visual encoder with bidirectional and autoregressive Mamba sequence modeling, supporting both CTC and autoregressive/non-autoregressive decoding. Evaluated on the National Library of Luxembourg dataset, the model achieves a character error rate of approximately 2% and a paragraph-level error rate of 6.07% (compared to DAN’s 5.24%), while offering 2.05× faster inference and only 1.26× memory growth—significantly outperforming mainstream systems like TrOCR and Tesseract and establishing a new paradigm for large-scale cultural heritage digitization.
📝 Abstract
End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR.
We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini).
Experiments on historical newspapers from the Bibliothèque nationale du Luxembourg, with newly released >99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster.
We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.