Benchmarking Large Language Models for Handwritten Text Recognition

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the zero-shot generalization capability of multimodal large language models (MLLMs) for handwritten text recognition (HTR), focusing on cross-lingual (English, French, German, Italian) and cross-temporal (modern vs. historical) document domains. Methodologically, we adopt a unified zero-shot prompting framework to benchmark leading MLLMs—including Claude 3.5 Sonnet—against the supervised model Transkribus, and introduce an output self-correction mechanism to assess error correction capacity. Key contributions include: (1) the first zero-shot MLLM-HTR benchmark spanning multiple languages and eras; (2) empirical findings that MLLMs achieve relatively high accuracy on modern English handwriting but exhibit strong language bias and marked performance degradation on historical scripts; (3) weak and inconsistent self-correction ability across models; and (4) no systematic superiority over Transkribus, challenging the assumption that MLLMs can readily replace supervised HTR systems.

Technology Category

Application Category

📝 Abstract
Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models' ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for Handwritten Text Recognition accuracy.
Evaluating LLMs' ability to correct transcription errors autonomously.
Comparing proprietary and open-source LLMs in zero-shot settings.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs recognize diverse handwriting without specific training
Proprietary models outperform open-source in zero-shot settings
LLMs show limited autonomous error correction capability
🔎 Similar Papers
No similar papers found.
G
Giorgia Crosilla
University of Bologna, Bologna, Italy
L
Lukas Klic
I Tatti, The Harvard University Center for Italian Renaissance Studies, Florence, Italy
Giovanni Colavizza
Giovanni Colavizza
University of Copenhagen and University of Bologna
Digital HumanitiesData ScienceArtificial Intelligence