🤖 AI Summary
This work addresses the long-standing challenge of systematically localizing, extracting, and semantically analyzing illustrations in large-scale historical illuminated manuscripts. We propose the first scalable deep learning pipeline that integrates modern computer vision with vision-language models to enable page-level illustration detection, precise extraction, and multimodal embedding. By aligning visual and textual representations in a shared latent space, our approach supports cross-corpus visual retrieval, clustering, and art-historical style analysis. Extensive experiments on heterogeneous manuscript collections—including those from the Vatican Library and the Borso d’Este Bible—demonstrate the method’s effectiveness, uncovering significant cross-manuscript visual patterns and interconnections. This framework advances digital humanities research by offering a novel, data-driven paradigm for the comparative study of historical visual culture.
📝 Abstract
The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual-artistic content embedded in historical illuminated manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically locate, extract, and analyze illustrations at scale remains a major challenge. We present a general and scalable AI-based pipeline for large-scale visual analysis of illuminated manuscripts. The framework integrates modern deep-learning models for page-level illustration detection, illustration extraction, and multimodal description, enabling scholars to search, cluster, and study visual materials and artistic trends across entire corpora. We demonstrate the applicability of this approach on large heterogeneous collections, including the Vatican Library and richly illuminated manuscripts such as the Bible of Borso d'Este. The system reveals meaningful visual patterns and cross-manuscript relationships by embedding illustrations into a shared representation space and analyzing their similarity structure (see figure 4). By harnessing recent advances in computer vision and vision-language models, our framework enables new forms of large-scale visual scholarship in historical studies, art history, and cultural heritage making it possible to explore iconography, stylistic trends, and cultural connections in ways that were previously impractical.