BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the vulnerability of LLM-as-a-Judge evaluation methods to unknown biases and the absence of systematic, automated mechanisms for bias detection, which undermines their reliability and robustness. To this end, the authors propose BiasScope, a novel framework that enables proactive, automated, and scalable probing of evaluation biases in large language models without relying on human-curated bias lists. BiasScope integrates LLM-driven systematic prompt generation with cross-model-family bias analysis and introduces JudgeBench-Pro, a more challenging benchmark for evaluator robustness. Experiments demonstrate that BiasScope is both generalizable and effective on the original JudgeBench, while revealing alarming fragility in current practices: on JudgeBench-Pro, mainstream LLMs acting as judges exhibit error rates exceeding 50%, highlighting significant weaknesses in existing evaluation paradigms.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50\% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.

Problem

Research questions and friction points this paper is trying to address.

bias

LLM-as-a-Judge

evaluation robustness

automated bias detection

unknown biases

Innovation

Methods, ideas, or system contributions that make the work stand out.

BiasScope

LLM-as-a-Judge

bias detection