Tears or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitation of current cultural alignment evaluations for large language models, which predominantly focus on declarative knowledge and fail to capture cross-cultural differences in affective interpretation. To bridge this gap, the authors propose CEDAR, a multimodal benchmark that uniquely targets culture-induced fine-grained emotional responses. They construct a dataset comprising 10,962 samples across seven languages and fourteen emotion categories, leveraging a novel pipeline that integrates generative pre-annotation by large language models with multilingual human validation. Through systematic evaluation of seventeen prominent multilingual models, the study reveals a significant disconnect between linguistic consistency and culturally aligned emotional understanding, thereby challenging conventional assessment paradigms in cross-cultural AI alignment.

Technology Category

Application Category

📝 Abstract

Culture serves as a fundamental determinant of human affective processing and profoundly shapes how individuals perceive and interpret emotional stimuli. Despite this intrinsic link extant evaluations regarding cultural alignment within Large Language Models primarily prioritize declarative knowledge such as geographical facts or established societal customs. These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing Culturally \underline{\textsc{E}}licited \underline{\textsc{D}}istinct \underline{\textsc{A}}ffective \underline{\textsc{R}}esponses. To construct CEDAR, we implement a novel pipeline that leverages LLM-generated provisional labels to isolate instances yielding cross-cultural emotional distinctions, and subsequently derives reliable ground-truth annotations through rigorous human evaluation. The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples. Comprehensive evaluations of 17 representative multilingual models reveal a dissociation between language consistency and cultural alignment, demonstrating that culturally grounded affective understanding remains a significant challenge for current models.

Problem

Research questions and friction points this paper is trying to address.

cultural alignment

affective responses

cross-cultural emotion

subjective interpretation

multilingual models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Culturally Elicited Affective Responses

Multimodal Benchmark

Cross-cultural Emotion Recognition