🤖 AI Summary
Large language models (LLMs) exhibit pervasive cultural biases due to training data skewed toward high-resource languages, limiting their capacity to accurately represent multicultural contexts in low-resource language settings.
Method: We introduce MyCulture—the first Malaysian multicultural Malay-language evaluation benchmark—covering six domains: art, attire, customs, entertainment, cuisine, and religion. It employs open-ended multiple-choice questions (without predefined options), contrasts structured output with free-form generation to expose architectural biases, and incorporates multilingual prompt variants to quantify linguistic bias and cross-lingual consistency. A theoretical model validates the efficacy of open-ended formats.
Results: Experiments reveal significant performance disparities among leading regional and global LLMs on MyCulture, exposing systematic deficits in low-resource cultural understanding. These findings underscore the urgent need for culturally embedded, linguistically inclusive evaluation frameworks.
📝 Abstract
Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.