Efficient Bayesian Inference from Noisy Pairwise Comparisons

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In generative model evaluation, human pairwise comparisons exhibit higher consistency than single-item ratings; however, existing Bradley–Terry–based methods often neglect annotator quality heterogeneity and lack theoretical convergence guarantees, undermining robustness and interpretability. To address this, we propose BBQ (Bayesian Bradley–Terry with Quality-aware Raters), the first Bayesian Bradley–Terry framework that explicitly models annotator quality. We establish a theoretical guarantee of monotonic likelihood convergence. BBQ employs an EM algorithm for rater-aware Bayesian inference, adaptively weighting noisy annotations based on inferred annotator reliability. Experiments demonstrate that BBQ achieves faster convergence and superior uncertainty calibration. Crucially, it maintains stable and reliable ranking and scoring under high noise and crowdsourced settings. By jointly inferring item utilities and annotator quality, BBQ significantly enhances the robustness and interpretability of human preference modeling in generative evaluation.

Technology Category

Application Category

📝 Abstract
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating generative models with noisy human pairwise comparisons
Aggregating inconsistent ratings from unreliable human evaluators
Improving Bradley-Terry models with rater quality and convergence guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Bradley-Terry model with rater quality
Expectation-Maximization ensures monotonic likelihood convergence
Downweights unreliable raters for robust rankings