🤖 AI Summary
Generative models exhibit significant performance degradation in multilingual and cross-cultural settings, primarily due to the scarcity of high-quality, culturally salient, and globally representative multilingual data resources.
Method: We propose a reproducible and scalable multi-path data construction framework that integrates multilingual web harvesting, automated cultural salience filtering, and community-driven data contribution. This framework systematically expands culturally aware training and evaluation datasets. Technically, we design novel cultural bias detection metrics and a multidimensional evaluation protocol—enabling, for the first time, quantitative assessment of generative models’ cross-cultural applicability.
Results: Experiments demonstrate that our framework effectively uncovers cultural biases across critical dimensions—including value expression, regional commonsense knowledge, and social norms. It establishes an extensible data infrastructure and standardized benchmark for fairness-aware model optimization and global deployment.
📝 Abstract
Generative models are known to have reduced performance in different global cultural contexts and languages. While continual data updates have been commonly conducted to improve overall model performance, bolstering and evaluating this cross-cultural competence of generative AI models requires data resources to be intentionally expanded to include global contexts and languages. In this work, we construct a repeatable, scalable, multi-pronged pipeline to collect and contribute culturally salient, multilingual data. We posit that such data can assess the state of the global applicability of our models and thus, in turn, help identify and improve upon cross-cultural gaps.