Are We on the Right Way to Assessing LLM-as-a-Judge?

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current LLM-as-a-Judge evaluation heavily relies on biased human annotations, undermining reliability and scalability. Method: We propose Sage—a fully unsupervised, axiom-driven evaluation framework grounded in rational choice theory—introducing dual-dimensional metrics: local self-consistency and global logical consistency, eliminating dependence on human labels. Our approach incorporates preference stability analysis, logical transitivity verification, hybrid dataset construction (combining structured and real-world user queries), and panel-based and deep-reasoning adjudication mechanisms. Contribution/Results: We first identify and empirically validate the “contextual preference” phenomenon, revealing substantial inconsistency in human annotations and challenging their status as a gold standard. Experiments show Sage metrics are robust and highly correlated with LLMBar and RewardBench2. GPT-5 and Gemini-2.5-Pro exhibit preference inconsistency on ~25% of challenging tasks. Fine-tuning and multi-judge ensembling significantly improve consistency.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM judges without human annotation to avoid bias.

Measures judge consistency through preference stability and transitivity.

Reveals significant reliability issues in state-of-the-art LLM judges.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sage evaluation suite eliminates human annotation bias

Measures LLM judges via local and global consistency axioms

Uses structured benchmarks and real-world queries dataset

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks