Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work investigates the underexplored capability of large language models (LLMs) to understand and reason about implicit cultural values—encompassing ethical, religious, social, and political dimensions. To this end, we introduce CQ-Bench, the first benchmark featuring multi-role natural dialogues for cultural value assessment, coupled with a three-tiered evaluation framework: attitude detection, value selection, and value extraction. We propose a human–GPT-4o collaborative validation paradigm for data construction, achieving 98.2% inter-annotator agreement; dialogues are grounded in the World Values Survey and GlobalOpinions, followed by triple automated filtering (inclusivity, consistency, implicitness). A few-shot cultural fine-tuning strategy boosts LLaMA-3.2-3B’s performance by over 10% using only 500 samples—surpassing o3-mini. Notably, o1 and Deepseek-R1 achieve human-level performance on value selection (F1=0.809/0.814), whereas open-ended generation lags significantly (max F1=0.602), revealing persistent bottlenecks in implicit value reasoning.

Technology Category

Application Category

📝 Abstract
Cultural Intelligence (CQ) refers to the ability to understand unfamiliar cultural contexts-a crucial skill for large language models (LLMs) to effectively engage with globally diverse users. While existing research often focuses on explicitly stated cultural norms, such approaches fail to capture the subtle, implicit values that underlie real-world conversations. To address this gap, we introduce CQ-Bench, a benchmark specifically designed to assess LLMs' capability to infer implicit cultural values from natural conversational contexts. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets, with topics including ethical, religious, social, and political. Our dataset construction pipeline includes rigorous validation procedures-incorporation, consistency, and implicitness checks-using GPT-4o, with 98.2% human-model agreement in the final validation. Our benchmark consists of three tasks of increasing complexity: attitude detection, value selection, and value extraction. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection (0.809 and 0.814), they still fall short in nuanced attitude detection, with F1 scores of 0.622 and 0.635, respectively. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning. Notably, fine-tuning smaller models (e.g., LLaMA-3.2-3B) on only 500 culturally rich examples improves performance by over 10%, even outperforming stronger baselines (o3-mini) in some cases. Using CQ-Bench, we provide insights into the current challenges in LLMs' CQ research and suggest practical pathways for enhancing LLMs' cross-cultural reasoning abilities.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to understand implicit cultural values
Evaluating performance in nuanced cultural attitude detection
Improving cross-cultural reasoning through targeted fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

CQ-Bench assesses LLMs' implicit cultural understanding
Multi-character dataset from World Value Survey
Fine-tuning improves cultural reasoning performance