Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Large language models (LLMs) exhibit inherent biases when employed as evaluators, undermining assessment reliability; existing in-context learning and fine-tuning approaches struggle to balance generality and applicability—especially for closed-source models. To address this, we propose the Reasoning-based Bias Detector (RBD), an external, plug-and-play, reasoning-driven module that detects and self-corrects four canonical bias types without modifying the original evaluator. RBD formalizes bias mechanisms via structured reasoning, integrates distillation-inspired reasoning fine-tuning, and leverages multi-scale models (1.5B–14B) to establish an iterative detect-feedback-revise pipeline. Evaluated across eight LLM evaluators and four bias benchmarks, RBD achieves average improvements of 18.5% in accuracy and 10.9% in consistency over prompting and fine-tuning baselines. It demonstrates strong generalization, computational efficiency, and model-agnosticism.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.

Problem

Research questions and friction points this paper is trying to address.

Detect and correct biases in LLM-as-a-Judge evaluations

Improve reliability of automated output assessments

Enhance evaluator accuracy and consistency via external feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-in Reasoning-based Bias Detector module

Iterative bias detection and feedback process

Distilled reasoning-based fine-tuning pipeline

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings