CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing vision-language models (VLMs) struggle with historical documents characterized by multilingual code-mixing, non-standard orthography, complex layouts, and severe image degradation—key bottlenecks in cultural heritage digitization. To address this, we introduce CHURRO, the first open-source, 3-billion-parameter vision-language model specifically designed for historical text recognition. We further release CHURRO-DS, the largest publicly available historical document dataset to date, spanning 46 language families, 22 centuries, and 155 distinct corpora. CHURRO employs end-to-end supervised training with joint multilingual text alignment and integrated printed/handwritten script modeling. On the CHURRO-DS test set, it achieves normalized Levenshtein similarity scores of 82.3% on printed text and 70.1% on handwritten text—substantially outperforming Gemini 2.5 Pro while reducing inference cost by 15.5×.

Technology Category

Application Category

📝 Abstract

Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.

Problem

Research questions and friction points this paper is trying to address.

Recognizing diverse historical texts with irregular layouts and degradation

Overcoming limitations of existing models designed for modern standardized texts

Improving accuracy and cost-effectiveness for historical document recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weight 3B-parameter VLM for historical text

Trained on largest historical dataset with 155 corpora

Achieves higher accuracy and lower cost than competitors

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?