🤖 AI Summary
This study addresses significant language bias in large language models (LLMs) when employed as judges (LLM-as-a-judge) in multilingual evaluation settings, which undermines assessment fairness. It presents the first systematic distinction and quantification of two types of bias: intra-linguistic-family bias (e.g., between Romance and Bantu languages) and cross-linguistic-family bias (e.g., English versus African languages). The findings reveal a consistent preference for English-language responses, with the language of the answer exerting a stronger influence on bias than that of the question. Through multilingual pairwise comparisons, perplexity analysis, and cross-lingual control experiments, the research demonstrates that bias is especially pronounced in culturally sensitive topics and cannot be fully explained by lower perplexity alone. Consequently, mainstream LLMs exhibit substantially degraded judging performance on non-European languages.
📝 Abstract
Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.