🤖 AI Summary
To address the slow inference speed (several seconds per page) caused by character-level autoregressive decoding in page-level handwritten text recognition, this paper proposes an efficient non-autoregressive decoding framework. Methodologically, it introduces three key innovations: (1) a novel windowed query mechanism that dynamically models both local and future contextual dependencies; (2) multi-token joint prediction to substantially improve decoding parallelism; and (3) meta-document-level context modeling to enhance long-range semantic consistency. Built upon an enhanced Transformer architecture, the framework achieves state-of-the-art average character error rate (CER) across ten full-page handwritten datasets. Inference time per page is reduced to sub-second latency—over 3× faster than autoregressive baselines—while preserving robust contextual modeling capability.
📝 Abstract
Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the na""ive character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.