π€ AI Summary
In LLM-as-a-Judge settings, agent evaluation models often inherit preference biases from teacher-model-generated annotations. To address this, we propose Assistant-Guided Debiasing for Judgment (AGDe-Judge), a novel three-stage debiasing paradigm: (1) unbiased assistant collaboration for training data construction, (2) supervision signal disentanglement to separate label-level and feedback-level biases, and (3) feedback-aware adversarial fine-tuning. AGDe-Judge is the first method to jointly correct biases at both the label and feedback layers, eliminating implicit preference modeling of teacher outputs. Evaluated on six mainstream benchmarks, it reduces teacher-induced preference bias by an average of 32.7% while maintaining strong alignment with human judgments (Kendallβs Ο β₯ 0.81). This work establishes a scalable, interpretable, and principled pathway toward trustworthy automated evaluation.
π Abstract
LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model's responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at https://github.com/Liuz233/AGDe-Judge.