🤖 AI Summary
This study addresses the systematic deficiency of generative large language models (LLMs) in representing global multicultural knowledge. We propose and empirically evaluate two retrieval-augmented generation (RAG) strategies: knowledge-base grounding (using curated, domain-specific cultural resources) and real-time web search grounding. For the first time, we conceptually distinguish *cultural propositional knowledge* (e.g., institutional or normative recognition) from *open-ended cultural fluency* (e.g., contextual appropriateness, non-stereotypical expression), and construct a multidimensional cultural familiarity benchmark comprising multiple-choice tasks and human evaluations. Results show that search grounding significantly improves accuracy on propositional knowledge tasks but exacerbates stereotypical reasoning and fails to enhance human-rated cultural familiarity; KB grounding is constrained by coverage breadth and retrieval precision. Our work exposes a critical gap between closed-book evaluation metrics and authentic cultural understanding, offering both a conceptual framework and methodological cautions for advancing LLM cultural adaptability research.
📝 Abstract
Generative large language models (LLMs) have been demonstrated to have gaps in diverse, cultural knowledge across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on the ability of LLMs to display familiarity with a diverse range of national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on a series of cultural familiarity benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., the norms, artifacts, and institutions of national cultures), while KB grounding's effectiveness is limited by inadequate knowledge base coverage and a suboptimal retriever. However, search grounding also increases the risk of stereotypical judgments by language models, while failing to improve evaluators' judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional knowledge about a culture and open-ended cultural fluency when it comes to evaluating the cultural familiarity of generative LLMs.