🤖 AI Summary
Large language models (LLMs) exhibit inherent biases when employed as evaluators, undermining assessment reliability; existing in-context learning and fine-tuning approaches struggle to balance generality and applicability—especially for closed-source models. To address this, we propose the Reasoning-based Bias Detector (RBD), an external, plug-and-play, reasoning-driven module that detects and self-corrects four canonical bias types without modifying the original evaluator. RBD formalizes bias mechanisms via structured reasoning, integrates distillation-inspired reasoning fine-tuning, and leverages multi-scale models (1.5B–14B) to establish an iterative detect-feedback-revise pipeline. Evaluated across eight LLM evaluators and four bias benchmarks, RBD achieves average improvements of 18.5% in accuracy and 10.9% in consistency over prompting and fine-tuning baselines. It demonstrates strong generalization, computational efficiency, and model-agnosticism.
📝 Abstract
LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.