DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

📅 2026-02-19
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency and substantial memory consumption in conventional Transformer-based online handwritten text recognition systems, which stem from the growing key-value cache during decoding. To overcome this, the authors propose a decoder-only Retentive Network (RetNet) architecture that replaces softmax-based attention with a softmax-free retention mechanism and incorporates multi-scale sequence priors, achieving linear time and memory complexity during decoding. A layer-wise gamma scaling strategy is further introduced to preserve modeling capacity—enhancing local dependency capture in shallow layers while enabling global context modeling in deeper layers—thus compensating for the representational loss caused by removing softmax attention. Evaluated on IAM-A, RIMES, Bentham, and READ-2016 benchmarks, the model achieves character error rates of 2.26%, 1.81%, 3.46%, and 4.21%, respectively, with 1.6–1.9× faster inference and 38–42% lower memory usage.

Technology Category

Application Category

📝 Abstract
State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.
Problem

Research questions and friction points this paper is trying to address.

handwritten text recognition
Transformer
KV cache
decoding efficiency
memory consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retentive Network
decoder-only
linear-time decoding
KV cache elimination
handwritten text recognition
🔎 Similar Papers
No similar papers found.
C
Changhun Kim
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Germany
M
Martin Mayr
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Germany
T
Thomas Gorges
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Germany
F
Fei Wu
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Germany
Mathias Seuret
Mathias Seuret
Friedrich-Alexander Universität Erlangen-Nßrnberg
historical document analysismachine learningimage processing
A
Andreas Maier
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Germany
Vincent Christlein
Vincent Christlein
University Erlangen-Nuremberg
Computer VisionDocument AnalysisArt AnalysisComputational HumanitiesAI4Conservation