Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses a significant self-preference bias in large language models (LLMs) during criterion-based evaluation, which distorts assessments even under fully objective, programmatically verifiable standards and thereby impedes model optimization. Focusing on binary judgment paradigms, the work systematically investigates this bias using the IFEval and HealthBench benchmarks, employing multi-model ensemble adjudication, rule-based verification, and quantitative analysis. It reveals, for the first time, the persistent nature of such bias in objective settings and identifies characteristics of scoring criteria that are particularly susceptible. Experiments demonstrate that LLMs can misjudge their own outputs as compliant with up to a 50% higher probability, leading to score discrepancies of up to 10 points in medical dialogue tasks. While ensemble adjudication partially mitigates the issue, it fails to eliminate the bias entirely.
📝 Abstract
LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.
Problem

Research questions and friction points this paper is trying to address.

self-preference bias
rubric-based evaluation
LLM-as-a-judge
evaluation bias
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Preference Bias
Rubric-Based Evaluation
LLM-as-a-Judge
Objective Rubrics
Ensemble Mitigation
🔎 Similar Papers
No similar papers found.
José Pombal
José Pombal
Sword Health
Ricardo Rei
Ricardo Rei
Sword Health
Healthcare AIMachine LearningNatural Language ProcessingLarge Language Models
A
André F. T. Martins
2Instituto de Telecomunicações, 3Instituto Superior Técnico, Universidade de Lisboa, 4TransPerfect, 5ELLIS Unit Lisbon