🤖 AI Summary
This work addresses key limitations of the LLM-as-a-judge framework for comparative evaluation: weak uncertainty modeling in pairwise comparisons, low ranking reliability, and high comparison cost. We propose the first generalized probabilistic modeling paradigm tailored to comparative assessment. Our core contributions are: (1) formal definition of “comparative reranking probability” as a principled uncertainty metric; (2) hybrid scoring fusion via Product-of-Experts expansion and uncertainty calibration, jointly leveraging absolute scores and relative pairwise comparisons; and (3) support for ranking confidence estimation and low-quality prediction detection. Experiments demonstrate that our method reduces required pairwise comparisons by ~50%, significantly improving both ranking efficiency and robustness. While model selection has minimal impact on final rankings, it critically affects uncertainty calibration quality.
📝 Abstract
This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.