CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cultural competence evaluation benchmarks suffer from three key limitations: fragmented taxonomies, narrow domain coverage, and heavy reliance on manual annotation. To address these, we propose the first systematic, multilingual, hierarchical cultural classification framework, integrating retrieval-augmented generation (RAG) with multilingual knowledge retrieval to construct a low-human-intervention, highly scalable cultural question-answer synthesis pipeline; an expert validation mechanism ensures high data fidelity. Based on this methodology, we release CultureSynth-7—a standardized, reproducible multilingual benchmark comprising 19,360 synthetically generated samples and 4,149 human-verified instances. Empirical evaluation reveals pronounced performance stratification, architectural biases, and regional disparities across leading large language models in cultural understanding tasks. CultureSynth-7 thus establishes a foundational infrastructure for rigorous, cross-cultural AI capability assessment.

Technology Category

Application Category

📝 Abstract
Cultural competence, defined as the ability to understand and adapt to multicultural contexts, is increasingly vital for large language models (LLMs) in global environments. While several cultural benchmarks exist to assess LLMs' cultural competence, current evaluations suffer from fragmented taxonomies, domain specificity, and heavy reliance on manual data annotation. To address these limitations, we introduce CultureSynth, a novel framework comprising (1) a comprehensive hierarchical multilingual cultural taxonomy covering 12 primary and 130 secondary topics, and (2) a Retrieval-Augmented Generation (RAG)-based methodology leveraging factual knowledge to synthesize culturally relevant question-answer pairs. The CultureSynth-7 synthetic benchmark contains 19,360 entries and 4,149 manually verified entries across 7 languages. Evaluation of 14 prevalent LLMs of different sizes reveals clear performance stratification led by ChatGPT-4o-Latest and Qwen2.5-72B-Instruct. The results demonstrate that a 3B-parameter threshold is necessary for achieving basic cultural competence, models display varying architectural biases in knowledge processing, and significant geographic disparities exist across models. We believe that CultureSynth offers a scalable framework for developing culturally aware AI systems while reducing reliance on manual annotationfootnote{Benchmark is available at https://github.com/Eyr3/CultureSynth.}.
Problem

Research questions and friction points this paper is trying to address.

Addresses fragmented cultural taxonomies in LLM evaluation
Reduces reliance on manual cultural data annotation
Synthesizes culturally relevant multilingual question-answer pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical multilingual cultural taxonomy
Retrieval-Augmented Generation methodology
Synthetic culturally relevant QA pairs
🔎 Similar Papers
No similar papers found.
X
Xinyu Zhang
Tongyi Lab, Alibaba Group Inc
P
Pei Zhang
Tongyi Lab, Alibaba Group Inc
S
Shuang Luo
Tongyi Lab, Alibaba Group Inc
Jialong Tang
Jialong Tang
Qwen Team, Alibaba
LLMNLP
Y
Yu Wan
Tongyi Lab, Alibaba Group Inc
Baosong Yang
Baosong Yang
Alibaba-inc
Machine LearningLarge Language ModelMachine Translation
F
Fei Huang
Tongyi Lab, Alibaba Group Inc