Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limited capability in inferring structured cultural metadata—such as creator, origin, and period—from images, and lack a systematic evaluation benchmark. This work addresses this gap by introducing the first multi-category, cross-cultural image benchmark dataset, along with attribute-level fine-grained evaluation metrics to assess model performance across diverse cultural contexts and metadata types. Leveraging a large language model as a judge (LLM-as-Judge), we conduct multidimensional evaluation using exact match, partial match, and attribute-level accuracy. Experimental results reveal that existing models rely on fragmented visual cues, yielding predictions that vary significantly across cultural backgrounds and attribute categories, and demonstrate poor consistency and interpretability.
📝 Abstract
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
Problem

Research questions and friction points this paper is trying to address.

structured cultural metadata
cross-cultural inference
vision-language models
cultural heritage
image understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured cultural metadata
cross-cultural benchmark
vision-language models
LLM-as-Judge
semantic alignment
🔎 Similar Papers
Yuechen Jiang
Yuechen Jiang
University of Hawaii at Manoa
NLPMultimodalLLM AgentsFinTech
E
Enze Zhang
School of Artificial Intelligence, Wuhan University
M
Md Mohsinul Kabir
University of Manchester
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
S
Stavroula Golfomitsou
Getty Conservation Institute
K
Konstantinos Arvanitis
University of Manchester
Sophia Ananiadou
Sophia Ananiadou
Professor, Computer Science, Manchester University, National Centre for Text Mining
Natural Language ProcessingText MiningComputational LinguisticsArtificial Intelligence