Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow inference speed (several seconds per page) caused by character-level autoregressive decoding in page-level handwritten text recognition, this paper proposes an efficient non-autoregressive decoding framework. Methodologically, it introduces three key innovations: (1) a novel windowed query mechanism that dynamically models both local and future contextual dependencies; (2) multi-token joint prediction to substantially improve decoding parallelism; and (3) meta-document-level context modeling to enhance long-range semantic consistency. Built upon an enhanced Transformer architecture, the framework achieves state-of-the-art average character error rate (CER) across ten full-page handwritten datasets. Inference time per page is reduced to sub-second latency—over 3× faster than autoregressive baselines—while preserving robust contextual modeling capability.

Technology Category

Application Category

📝 Abstract
Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the na""ive character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.
Problem

Research questions and friction points this paper is trying to address.

Reduce prediction time for page-level handwritten text recognition
Improve context modeling in end-to-end attention-based approaches
Enable multi-token predictions per query to speed up decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Windowed queries for enlarged context modeling
Multi-token predictions per query
End-to-end attention-based recognition strategy