Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates bias in large language models (LLMs) when deployed as automated content evaluators in communication systems, exposing how implicit and explicit biases in their scoring processes undermine fairness and user trust. Methodologically, we systematically identify 11 categories of judgment bias and conduct empirical analysis using GPT-Judge and JudgeLM across four dimensions: pointwise scoring, fine-grained criterion-level analysis, bias-aware response fine-tuning, and cross-dataset evaluation. Key findings include: (i) mainstream LLM judges exhibit robustness to biased inputs; (ii) refining scoring criteria significantly improves fairness; (iii) training data bias severely degrades evaluation reliability; and (iv) task difficulty strongly correlates with scoring outcomes. Based on these insights, we propose four transferable bias-mitigation strategies—criterion decomposition, bias-aware calibration, data balancing, and difficulty-aware normalization—to advance trustworthy, equitable AI judging paradigms.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI "judges" is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two LLM-as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art LLM judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. Providing a detailed scoring rubric further enhances this robustness. We further found that fine-tuning an LLM on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judged scores correlate with task difficulty: a challenging dataset like GPQA yields lower average scores, whereas an open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores. Finally, we proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
Problem

Research questions and friction points this paper is trying to address.

Evaluating bias in LLM judges used for content quality assessment
Investigating 11 types of implicit and explicit judgment biases
Developing mitigation strategies for fair AI judging in communications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically investigated 11 types of LLM judgment biases
Proposed four mitigation strategies for fair AI judging
Found scoring rubrics enhance robustness to biased inputs
🔎 Similar Papers
No similar papers found.
J
Jiaxin Gao
School of Cyber Science and Engineering, Wuhan University, China
C
Chen Chen
School of Computer Science and Engineering, Nanyang Technological University, Singapore
Y
Yanwen Jia
School of Cyber Science and Engineering, Wuhan University, China
Xueluan Gong
Xueluan Gong
Nanyang Technological University
Computer science
Kwok-Yan Lam
Kwok-Yan Lam
Nanyang Technological University
CybersecurityPrivacy-Preserving technologiesDigital TrustDistributing systemsLegalTech
Q
Qian Wang
School of Cyber Science and Engineering, Wuhan University, China