TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies two critical inconsistencies in the LLM-as-a-judge automatic evaluation paradigm: (1) scoring-comparison inconsistency—where lower-scoring responses win over higher-scoring ones in pairwise comparisons—and (2) pairwise transitivity inconsistency—e.g., cyclic preferences (A > B > C > A) or equivalence contradictions. To address these, we propose TrustJudge, the first framework that integrates distribution-sensitive probabilistic scoring modeling with likelihood-aware preference aggregation. It jointly leverages continuous expected scores, discrete score probability modeling, bidirectional preference probability estimation, and perplexity-based calibration—enabling training-free, high-consistency evaluation. Experiments on Llama-3.1-70B-Instruct show TrustJudge reduces scoring-comparison inconsistency by 8.43% and transitivity inconsistency by 10.82%, while maintaining superior evaluation accuracy. The framework significantly enhances the reliability and trustworthiness of LLM-based judgment.

Technology Category

Application Category

📝 Abstract
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C eq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
Problem

Research questions and friction points this paper is trying to address.

LLM evaluators show critical inconsistencies in automated assessment frameworks
Score comparison and pairwise transitivity inconsistencies undermine evaluation reliability
Discrete rating systems cause information loss and ambiguous tie judgments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-sensitive scoring using continuous rating expectations
Likelihood-aware aggregation resolving transitivity violations
Probabilistic framework overcoming information loss limitations
🔎 Similar Papers
No similar papers found.
Y
Yidong Wang
Peking University
Y
Yunze Song
National University of Singapore
T
Tingyuan Zhu
Institute of Science Tokyo
X
Xuanwang Zhang
Nanjing University
Zhuohao Yu
Zhuohao Yu
Peking University
Natural Language ProcessingSoftware Engineering
H
Hao Chen
Google DeepMind
Chiyu Song
Chiyu Song
Zhejiang University; Westlake University
natural language processinglarge language modelstext generationchatbot
Q
Qiufeng Wang
Southeast University
Cunxiang Wang
Cunxiang Wang
Tsinghua University; ZhipuAI
Large Language ModelsLLM EvaluationLLM Post-training
Z
Zhen Wu
Nanjing University
Xinyu Dai
Xinyu Dai
Nanjing University
Y
Yue Zhang
Westlake University
W
Wei Ye
Peking University
Shikun Zhang
Shikun Zhang
北京大学