Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge

📅 2024-06-12

📈 Citations: 3

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work identifies and systematically quantifies a pervasive positional bias in large language models (LLMs) when employed as automated evaluators: their preference rankings are significantly influenced by the order in which candidate answers appear in the prompt, thereby undermining evaluation reliability. We conduct paired and listwise comparisons across 22 tasks using MT-Bench and DevBench, evaluating 15 LLM-based judges and constructing a dataset of over 150,000 instances. We introduce three novel metrics—repetition stability, positional consistency, and preference fairness—to localize bias sources at the judge-, candidate-, and task-levels, and empirically demonstrate that positional bias strongly correlates with answer quality gaps—not random noise. Results confirm the ubiquity of this bias across models and tasks, with substantial inter-model and inter-task variation. Our findings provide actionable, data-driven strategies for bias mitigation, advancing the robustness and fairness of LLM-based evaluation.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge has emerged as a promising alternative to human evaluators across various tasks, yet inherent biases - particularly position bias, the tendency to favor solutions based on their position within the prompt - compromise its reliability. This exploratory study evaluates position bias in LLM judges across pairwise and list-wise comparison settings, introducing three metrics: repetition stability, position consistency, and preference fairness. Our experiments, involving 15 LLM judges across MTBench and DevBench with 22 tasks and approximately 40 solution-generating models, result in over 150,000 evaluation instances. We identify Judge-Level, Candidate-Level, and Task-Level factors contributing to bias. The findings confirm that position bias is not due to random chance and varies significantly across judges and tasks. While position bias is weakly influenced by the length of prompt components, it is strongly affected by the quality gap between solutions. Our agreement and disagreement analysis among judges further provides insights into the distribution of judging difficulty across the dataset, and highlights the potential for dataset modifications.

Problem

Research questions and friction points this paper is trying to address.

Evaluates position bias in LLM-as-a-Judge systems

Identifies factors causing bias across judges and tasks

Analyzes impact of solution quality gaps on bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces metrics for position bias evaluation

Analyzes bias across multiple levels and tasks

Examines prompt length and solution quality impact

🔎 Similar Papers

No similar papers found.