đ¤ AI Summary
Traditional digitization of historical documents has largely been confined to character transcription, lacking the structural and semantic information necessary for in-depth analysis. This work proposes VERITAS, a framework that reconceptualizes digitization as an integrated pipeline combining transcription, layout analysis, and semantic enrichment. Through four stagesâpreprocessing, extraction, refinement, and enhancementâit enables end-to-end transformation from document images to structured knowledge. VERITAS employs a model-agnostic, modular design with a schema-driven architecture, allowing declarative specification of information extraction targets and integrating OCR, layout analysis, semantic annotation, and retrieval-augmented generation techniques. Evaluated on over 1,600 pages of Renaissance chronicles, the framework reduces word error rate by 67.6% compared to commercial OCR systems, cuts manual proofreading time by two-thirds, and effectively supports downstream historical question-answering tasks.
đ Abstract
The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline's output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.