Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

πŸ“… 2026-03-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the risk that large language models (LLMs) acting as evaluators may introduce unknown or adversarially exploited biases, thereby compromising the reliability of feedback loops in autonomous AI systems. To mitigate this, the authors propose the Average Bias-Boundedness (A-BB) algorithmic frameworkβ€”the first approach to formally control LLM evaluator bias with verifiable, bounded guarantees, even when the bias direction is unknown or manipulated adversarially. By integrating statistical hypothesis testing with rank correlation analysis, A-BB is evaluated on the Arena-Hard-Auto benchmark under Ο„=0.5 and Ξ΄=0.01, demonstrating retention of 61%–99% of the original ranking correlation under format- and structure-induced biases, with most configurations exceeding 80%. This balance ensures both safety and practical utility.

Technology Category

Application Category

πŸ“ Abstract
As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge
bias
unbiased evaluation
autonomous AI systems
bias guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

bias-bounded evaluation
LLM-as-a-Judge
average bias-boundedness
provable fairness
autonomous AI feedback
πŸ”Ž Similar Papers
No similar papers found.