JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

๐Ÿ“… 2026-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing cultural evaluation benchmarks oversimplify culture as static facts, failing to capture context-dependent, deep-seated cultural errors. This work proposes the first assessment framework specifically targeting โ€œthick cultural errorsโ€ and introduces JuICE, a multilingual dataset comprising 1,050 query-response pairs from the United States, South Korea, Indonesia, and Bangladesh, annotated with 7,470 fine-grained cultural and linguistic errors across English and local languages. Through meticulous human annotation and systematic evaluation using the LLM-as-a-Judge paradigm, the study reveals that even state-of-the-art models achieve only an F1 score of 0.52 on error span detection, frequently missing nuanced errors readily identifiable by local residents. These findings underscore a fundamental limitation in large language modelsโ€™ capacity for genuine cultural understanding.
๐Ÿ“ Abstract
As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.
Problem

Research questions and friction points this paper is trying to address.

cultural errors
LLM-judge
cultural evaluation
multilingual benchmark
contextual appropriateness
Innovation

Methods, ideas, or system contributions that make the work stand out.

cultural errors
LLM-as-a-Judge
multilingual benchmark
span-level annotation
thick cultural meaning