Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the observed self-preference of large language models (LLMs) during evaluation, which may stem from confounding factors such as "narcissism" or task difficulty. To disentangle these influences, the authors propose an evaluator-quality baseline that compares a model’s preference for its own erroneous outputs against those generated by others. This approach reduces measurement error by 89.6% and, when validated across 37,448 queries, reveals that only 51% of the originally reported self-preference effects remain statistically significant under the new baseline. By integrating an automated evaluation framework, significance testing, and entropy analysis, this study substantially corrects prior misattributions of LLM "narcissism" and establishes a more reliable paradigm for evaluating self-preference in language models.

Technology Category

Application Category

πŸ“ Abstract
Recent research has shown that large language models (LLMs) favor their own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of"easy"versus"hard"evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.
Problem

Research questions and friction points this paper is trying to address.

self-preference
LLM evaluators
evaluation bias
narcissism
judge bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-preference bias
LLM evaluation
evaluator quality baseline
judge bias
measurement confound
πŸ”Ž Similar Papers
No similar papers found.
D
Dani Roytburg
Carnegie Mellon University
M
Matthew Bozoukov
University of California, San Diego
M
Matthew Nguyen
University of Virginia
M
Mackenzie Puig-Hall
Apart Research
Narmeen Oozeer
Narmeen Oozeer
Research Engineer, Martian Learning
mathematicsdeep learninginterpretability