🤖 AI Summary
Current large language models (LLMs) exhibit systematic biases when employed as automatic evaluators, resulting in low agreement with human judgments. This work reframes text quality assessment as a ranking task and proposes PairS, an uncertainty-guided pairwise preference search method. Methodologically, it adapts the pairwise preference learning paradigm from RLHF to evaluation, integrating uncertainty-driven search with LLM-based ranking inference to yield interpretable, human-aligned assessments. Key contributions include: (i) the first empirical demonstration that standard calibration techniques fail to ensure human alignment; and (ii) rigorous validation that pairwise preferences significantly improve LLMs’ transitivity modeling and calibration fidelity. Across multiple benchmark tasks, PairS substantially outperforms direct scoring approaches, achieving state-of-the-art performance. Experimental results confirm that pairwise comparison—not absolute scoring—is critical for enhancing consistency, reliability, and interpretability in LLM-based evaluation.
📝 Abstract
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human judgement, revealing that existing calibration methods aimed at mitigating biases are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts. PairS achieves state-of-the-art performance on representative evaluation tasks and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PairS benefits from calibration.