Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) employed as automated evaluators exhibit systematic biases in complex tasks, yet these biases remain poorly characterized and quantified. Method: We introduce ComplexEval—the first benchmark explicitly designed for high-complexity evaluation scenarios—comprising 12 foundational and 3 advanced tasks, incorporating multi-dimensional scoring criteria, unstructured reference answers, and fine-grained evaluation protocols. Contribution/Results: We systematically identify and quantify six previously unexplored evaluation biases, including the “curse of knowledge”—a paradoxical phenomenon where increased model capability exacerbates judgment bias. Empirical analysis across mainstream LLMs reveals statistically significant bias in all models, with bias magnitude monotonically increasing with task complexity. Our work provides critical empirical data and theoretical insights to advance the development of reliable, verifiable automated evaluation frameworks.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reliability in complex multi-faceted tasks
Measuring bias susceptibility in LLM judges under advanced scenarios
Investigating paradoxical vulnerability in Large Reasoning Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed ComplexEval benchmark for bias quantification
Systematically investigated six unexplored auxiliary information biases
Analyzed bias susceptibility scaling with task complexity
🔎 Similar Papers
No similar papers found.
Weiyuan Li
Weiyuan Li
Alibaba Group
RLLLMAgent
X
Xintao Wang
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
S
Siyu Yuan
School of Data Science, Fudan University
R
Rui Xu
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Jiangjie Chen
Jiangjie Chen
ByteDance Seed
NLPMachine ReasoningLarge Language ModelsAutonomous Agent
Q
Qingqing Dong
College of Cyber Science, Nankai University
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Deqing Yang
Deqing Yang
School of Data Science, Fudan University