🤖 AI Summary
This study investigates the causal relationship between alignment training—such as instruction tuning and preference tuning—and numerical bias in large language models (LLMs) when used as evaluators (LLM-as-a-judge), a phenomenon where models exhibit a tendency to favor specific score values, thereby compromising evaluation reliability. Through comparative analysis of model outputs before and after alignment, the authors conduct mitigation experiments employing strategies including temperature scaling, distribution calibration, and score range adjustment. Their findings reveal that alignment significantly exacerbates numerical bias, with score range adjustment emerging as the most effective intervention: it not only substantially reduces bias but also enhances overall evaluation performance, despite its heuristic nature. This work provides both empirical insights and practical solutions for understanding and mitigating bias in LLM-based evaluation.
📝 Abstract
"LLM-as-a-judge,"which utilizes large language models (LLMs) as evaluators, has proven effective in many evaluation tasks. However, evaluator LLMs exhibit numerical bias, a phenomenon where certain evaluation scores are generated disproportionately often, leading reduced evaluation performance. This study investigates the cause of this bias. Given that most evaluator LLMs are aligned through instruction tuning and preference tuning, and that prior research suggests alignment reduces output diversity, we hypothesize that numerical bias arises from alignment. To test this, we compare outputs from pre- and post-alignment LLMs, and observe that alignment indeed increases numerical bias. We also explore mitigation strategies for post-alignment LLMs, including temperature scaling, distribution calibration, and score range adjustment. Among these, score range adjustment is most effective in reducing bias and improving performance, though still heuristic. Our findings highlight the need for further work on optimal score range selection and more robust mitigation strategies.