Judging LLMs on a Simplex

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) used as automatic evaluators suffer from ranking unidentifiability—specifically, pairwise scoring preserves rank identifiability, whereas ternary or higher-arity scoring exhibits inherent geometric unidentifiability. Method: We propose a geometric framework grounded in the probability simplex, uncovering for the first time a phase-transition phenomenon in LLM-based evaluation. Our approach jointly models aleatoric and epistemic uncertainty, incorporates interpretable Bayesian priors encoding domain knowledge, and conducts systematic sensitivity analysis. Contribution/Results: Evaluated across multiple benchmarks, our method significantly improves ranking accuracy while achieving credible interval coverage closer to nominal levels. This demonstrates the effectiveness and robustness of holistic uncertainty modeling in LLM evaluation, providing both theoretical insight into identifiability limits and practical advances in reliable automated assessment.

Technology Category

Application Category

📝 Abstract
Automated evaluation of free-form outputs from large language models (LLMs) is challenging because many distinct answers can be equally valid. A common practice is to use LLMs themselves as judges, but the theoretical properties of this approach are not yet well understood. We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable using LLM judges. Our theoretical analysis uncovers a"phase transition"in ranking identifiability: for binary scoring systems, true rankings are identifiable even with weak judges under mild assumptions, while rankings become non-identifiable for three or more scoring levels even with infinite data, absent additional prior knowledge. This non-identifiability highlights how uncertainty in rankings stems from not only aleatoric uncertainty (i.e., inherent stochasticity in the data) but also epistemic uncertainty regarding which assumptions hold, an aspect that has received limited attention until now. To integrate both types of uncertainty, we use Bayesian inference to encode assumptions as priors and conduct sensitivity analysis of ranking estimates and credible intervals. Empirical evaluations across multiple benchmarks demonstrate that Bayesian inference yields more accurate rankings and substantially improves coverage rates. These results underscore the importance of taking a more holistic approach to uncertainty quantification when using LLMs as judges.
Problem

Research questions and friction points this paper is trying to address.

Challenges in automated evaluation of diverse LLM outputs
Theoretical limits of ranking identifiability with LLM judges
Integrating epistemic and aleatoric uncertainty in Bayesian evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric framework on probability simplex
Bayesian inference for uncertainty quantification
Phase transition in ranking identifiability
🔎 Similar Papers
No similar papers found.