🤖 AI Summary
This study addresses cross-modal understanding of museum artifacts: aligning visual features extracted from cultural relic images with historical knowledge to support interactive visitor question-answering and deep historical reasoning. To this end, we construct the first large-scale museum image-text dataset—comprising 65 million images and 200 million high-quality QA pairs—and establish the first expert-annotated, full-scenario museum Visual Question Answering (VQA) benchmark. We propose five fine-grained evaluation tasks to systematically assess two representative multimodal models: BLIP (for vision-language alignment) and LLaVA (an instruction-tuned vision-language large language model). Experimental results demonstrate that LLaVA significantly outperforms BLIP on knowledge-intensive questions. All data, benchmarks, and code are publicly released to advance AI for cultural heritage—from perceptual recognition toward knowledge-grounded reasoning—and to establish a new paradigm for museum-oriented multimodal understanding.
📝 Abstract
Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: https://github.com/insait-institute/Museum-65