π€ AI Summary
This work addresses the limitations of existing cultural awareness evaluation methods, which rely on costly human annotations and lack fine-grained, interpretable mechanisms for free-form text. To bridge this gap, we propose ExCAMβthe first framework for detecting and explaining cultural errors in free text. We construct ExCAM40k, a multi-source benchmark dataset via synthetic error augmentation, and train a culturally aware model using reconstruction and enhancement strategies. By integrating error injection with interpretability analysis, our approach achieves up to 80% accuracy in cultural error detection on a balanced test set, substantially outperforming current baselines, including GPT-5. ExCAM thus establishes an efficient and interpretable new paradigm for evaluating cultural sensitivity in natural language.
π Abstract
Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.