Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations of the LLM-as-a-judge framework for comparative evaluation: weak uncertainty modeling in pairwise comparisons, low ranking reliability, and high comparison cost. We propose the first generalized probabilistic modeling paradigm tailored to comparative assessment. Our core contributions are: (1) formal definition of “comparative reranking probability” as a principled uncertainty metric; (2) hybrid scoring fusion via Product-of-Experts expansion and uncertainty calibration, jointly leveraging absolute scores and relative pairwise comparisons; and (3) support for ranking confidence estimation and low-quality prediction detection. Experiments demonstrate that our method reduces required pairwise comparisons by ~50%, significantly improving both ranking efficiency and robustness. While model selection has minimal impact on final rankings, it critically affects uncertainty calibration quality.

Technology Category

Application Category

📝 Abstract
This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.
Problem

Research questions and friction points this paper is trying to address.

Generalized probabilistic modeling in LLM-as-a-judge frameworks
Improved uncertainty estimation for individual comparisons
Combining absolute and comparative scoring enhances performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalised probabilistic modelling for diverse options
Improved uncertainty estimates for efficient comparisons
Combined absolute and comparative scoring boosts performance
🔎 Similar Papers
No similar papers found.
Yassir Fathullah
Yassir Fathullah
Google DeepMind
Sequence UncertaintyMulti-Modal LLMsEfficiency
M
Mark J. F. Gales
Engineering Department, University of Cambridge, UK