Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses key limitations of the LLM-as-a-judge framework for comparative evaluation: weak uncertainty modeling in pairwise comparisons, low ranking reliability, and high comparison cost. We propose the first generalized probabilistic modeling paradigm tailored to comparative assessment. Our core contributions are: (1) formal definition of “comparative reranking probability” as a principled uncertainty metric; (2) hybrid scoring fusion via Product-of-Experts expansion and uncertainty calibration, jointly leveraging absolute scores and relative pairwise comparisons; and (3) support for ranking confidence estimation and low-quality prediction detection. Experiments demonstrate that our method reduces required pairwise comparisons by ~50%, significantly improving both ranking efficiency and robustness. While model selection has minimal impact on final rankings, it critically affects uncertainty calibration quality.

Technology Category

Application Category

📝 Abstract

This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.

Problem

Research questions and friction points this paper is trying to address.

Generalized probabilistic modeling in LLM-as-a-judge frameworks

Improved uncertainty estimation for individual comparisons

Combining absolute and comparative scoring enhances performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalised probabilistic modelling for diverse options

Improved uncertainty estimates for efficient comparisons

Combined absolute and comparative scoring boosts performance

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks