🤖 AI Summary
Current LLM cultural alignment evaluations predominantly adopt a “trivia-centric paradigm,” reducing culture to static facts or values and assessing it via closed-ended questions—overlooking culture’s inherent plurality, dynamism, and the implicit cultural assumptions embedded throughout evaluation design. Method: We propose an “intentional cultural assessment” framework that systematically identifies cultural assumptions across the entire evaluation pipeline (task formulation, dataset construction, metric definition, and result interpretation); integrates researcher positionality reflection and community-engaged design; and employs critical analysis alongside participatory, human-AI interaction–inspired methodologies to deconstruct cultural biases in mainstream benchmarks. Contribution/Results: Our work reveals systematic cultural biases in prominent evaluation suites, articulates four actionable principles for culturally sensitive assessment, and advances NLP evaluation toward greater inclusivity and cultural reflexivity.
📝 Abstract
The prevailing ``trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly ``neutral'' evaluation settings. In this position paper, we argue for extbf{intentionally cultural evaluation}: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don't know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.