Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work identifies a critical vulnerability in the LLM-as-a-judge paradigm: pairwise preference evaluation is highly susceptible to spurious feature manipulation, causing 35% preference reversals and inducing erroneous judgments of low-quality outputs and training bias. In contrast, absolute scoring demonstrates superior robustness (only 9% reversals), substantially reducing discriminative bias and improving assessment fidelity. Through controlled perturbation injection, cross-protocol consistency analysis, and response quality attribution, the study systematically quantifies, for the first time at scale, the differences between these protocols along three dimensions—bias, robustness, and reliability. It further proposes a data-aware feedback protocol selection framework grounded in input-output characteristics. The findings provide actionable, empirically validated guidelines for designing feedback protocols in RLAIF-based training and model benchmarking, directly addressing practical deployment challenges in human-free evaluation pipelines.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are widely used as proxies for human labelers in both training (Reinforcement Learning from AI Feedback) and large-scale response evaluation (LLM-as-a-judge). Alignment and evaluation are critical components in the development of reliable LLMs, and the choice of feedback protocol plays a central role in both but remains understudied. In this work, we show that the choice of feedback protocol (absolute scores versus relative preferences) can significantly affect evaluation reliability and induce systematic biases. In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation. Generator models can exploit spurious attributes (or distractor features) favored by the LLM judge, resulting in inflated scores for lower-quality outputs and misleading training signals. We find that absolute scoring is more robust to such manipulation, producing judgments that better reflect response quality and are less influenced by distractor features. Our results demonstrate that generator models can flip preferences by embedding distractor features, skewing LLM-as-a-judge comparisons and leading to inaccurate conclusions about model quality in benchmark evaluations. Pairwise preferences flip in about 35% of the cases, compared to only 9% for absolute scores. We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.

Problem

Research questions and friction points this paper is trying to address.

Evaluating feedback protocols for bias in LLM-based evaluation

Assessing impact of feedback protocol choice on evaluation reliability

Analyzing robustness of absolute vs pairwise scoring to manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise evaluation protocols vulnerable to distracted evaluation

Absolute scoring robust to manipulation by distractor features

Feedback protocol choice affects evaluation reliability significantly

🔎 Similar Papers

The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators

2024-06-18Citations: 1

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

2024-03-25arXiv.orgCitations: 32

Apple

Seattle, United States of America

Authors to Follow