🤖 AI Summary
This study addresses the prevalent issue of “cultural anachronism” in vision-language models (VLMs)—the erroneous application of temporally incongruent concepts when interpreting historical artifacts. To this end, the authors formally define and quantify this phenomenon and introduce TAB-VLM, the first temporally grounded benchmark for non-Western cultural heritage, comprising 1,600 Indian artifacts and 600 time-sensitive question-answer pairs. Leveraging extensive human-annotated data, they systematically evaluate ten state-of-the-art VLMs and find that even the best-performing model (GPT-5.2) achieves only 58.7% accuracy, revealing a widespread deficiency in temporal reasoning that does not substantially improve with model scale. This work fills a critical gap in evaluating VLMs’ temporal understanding within non-Western cultural contexts.
📝 Abstract
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.