Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work investigates the robustness of large language models (LLMs) acting as evaluators (“LLM-judges”) to cognitive markers—such as “possibly” or “uncertain”—in generated text. We identify a systematic negative bias: LLM-judges consistently underestimate the quality of outputs containing uncertainty expressions. To address this, we introduce ERMER, the first benchmark explicitly designed to probe cognitive marker sensitivity. ERMER combines prompt engineering with controllable perturbations to generate cognitively marked variants and employs both single-turn scoring and pairwise comparison evaluation paradigms across mainstream models (e.g., GPT-4o). Experiments reveal statistically significant negative bias across all tested LLM-judges (mean score deviation: 12.3%, *p* < 0.001), indicating that evaluations are driven more by linguistic surface cues than semantic correctness. This study provides the first empirical evidence of cognitive marker sensitivity in LLM-based evaluation, offering critical insights and methodological foundations for developing trustworthy AI assessment frameworks.

Technology Category

Application Category

📝 Abstract

In line with the principle of honesty, there has been a growing effort to train large language models (LLMs) to generate outputs containing epistemic markers. However, evaluation in the presence of epistemic markers has been largely overlooked, raising a critical question: Could the use of epistemic markers in LLM-generated outputs lead to unintended negative consequences? To address this, we present EMBER, a benchmark designed to assess the robustness of LLM-judges to epistemic markers in both single and pairwise evaluation settings. Our findings, based on evaluations using EMBER, reveal that all tested LLM-judges, including GPT-4o, show a notable lack of robustness in the presence of epistemic markers. Specifically, we observe a negative bias toward epistemic markers, with a stronger bias against markers expressing uncertainty. This suggests that LLM-judges are influenced by the presence of these markers and do not focus solely on the correctness of the content.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM-judges' robustness to epistemic markers

Investigating bias in LLM evaluations with uncertainty expressions

Evaluating impact of epistemic markers on LLM-generated outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

EMBER benchmark tests LLM-judges robustness

LLM-judges show bias against uncertainty markers

Evaluates single and pairwise epistemic marker effects

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks