🤖 AI Summary
This study addresses the lack of systematic evaluation of cultural understanding in current vision-language models within the context of cross-cultural art criticism. It proposes the first three-tiered assessment framework, integrating automated metrics, rubric-based expert scoring, and isotonic regression calibration, and introduces a single-primary-reviewer mechanism to enhance scoring consistency. Evaluated on a test set of 152 samples anchored by 294 expert judgments spanning six major cultural traditions, the approach reduces mean absolute error by 5.2%, effectively uncovering model biases favoring Western art and highlighting the unreliability of multi-reviewer scoring. The framework establishes a reproducible benchmark for evaluating cross-cultural visual-language understanding.
📝 Abstract
Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. However, cultural understanding and interpretability are often overlooked when evaluating these models. To overcome this limitation, this paper introduces a tri-tier evaluation framework for cross-cultural art-critique assessment. Tier I provides a series of automated metrics indicating cultural coverage. Tier II leverages theory-informed template-based scoring using a single primary judge across five evaluation dimensions (Coverage, Alignment, Depth, Accuracy, Quality), each rated on a 1--5 scale. Tier III then calibrates the aggregated scores from Tier II via isotonic regression. The proposed evaluation framework is validated with a large-scale experiment covering 15 different VLMs on 294 evaluation art-critique pairs spanning six different cultural traditions. Our findings reveal that (i) automated metrics are unreliable for cultural depth analysis, (ii) Western samples score higher than non-Western samples under our sampling and evaluation template, highlighting potential model biases, and (iii) VLMs exhibit a consistent performance gap, performing well in visual description but underperforming in cultural interpretation. Dataset and code are available at https://github.com/yha9806/VULCA-Framework.