π€ AI Summary
LLM-as-a-judge exhibits systematic biases relative to human judgments, undermining its reliability for trustworthy evaluation. To address this, we propose Bridgeβa unified statistical framework that jointly models latent human preference scores and the LLMβs deviation mechanism, enabling alignment under both absolute scoring and pairwise comparison paradigms. Bridge innovatively integrates latent variable modeling with linear covariate regression, yielding interpretable bias attribution and calibrated score estimation, while providing rigorous statistical inference guarantees and asymptotic theoretical foundations. Evaluated across six prominent LLM evaluators and two benchmarks (AlpacaEval and MT-Bench), Bridge significantly improves agreement with human judgments: average accuracy increases by 4.2%, calibration error decreases by 31%, and KL divergence reduces by 28%.
π Abstract
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.