Quid est VERITAS? A Modular Framework for Archival Document Analysis

📅 2026-03-30
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Traditional digitization of historical documents has largely been confined to character transcription, lacking the structural and semantic information necessary for in-depth analysis. This work proposes VERITAS, a framework that reconceptualizes digitization as an integrated pipeline combining transcription, layout analysis, and semantic enrichment. Through four stages—preprocessing, extraction, refinement, and enhancement—it enables end-to-end transformation from document images to structured knowledge. VERITAS employs a model-agnostic, modular design with a schema-driven architecture, allowing declarative specification of information extraction targets and integrating OCR, layout analysis, semantic annotation, and retrieval-augmented generation techniques. Evaluated on over 1,600 pages of Renaissance chronicles, the framework reduces word error rate by 67.6% compared to commercial OCR systems, cuts manual proofreading time by two-thirds, and effectively supports downstream historical question-answering tasks.
📝 Abstract
The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline's output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.
Problem

Research questions and friction points this paper is trying to address.

historical document digitisation
structural information
semantic enrichment
computational analysis
OCR limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

modular framework
schema-driven architecture
layout analysis
semantic enrichment
retrieval-augmented generation
🔎 Similar Papers
No similar papers found.
L
Leonardo Bassanini
UniversitĂ  degli Studi di Milano, University Library Service, Via Santa Sofia, 7/9 - 20122 Milano (Italy)
L
Ludovico Biancardi
UniversitĂ  degli Studi di Milano, University Library Service, Via Santa Sofia, 7/9 - 20122 Milano (Italy)
Alfio Ferrara
Alfio Ferrara
Dipartimento di Informatica, UniversitĂ  degli Studi di Milano
data sciencenatural language processingdigital humanities
A
Andrea Gamberini
UniversitĂ  degli Studi di Milano, Department of Historical Studies, Via Festa del Perdono, 7 - 20122 Milano (Italy)
S
Sergio Picascia
UniversitĂ  degli Studi di Milano, Department of Computer Science, Via Celoria, 18 - 20133 Milano (Italy)
F
Folco Vaglienti
UniversitĂ  degli Studi di Milano, Department of Historical Studies, Via Festa del Perdono, 7 - 20122 Milano (Italy)