Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

πŸ“… 2025-08-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
LLM-as-a-judge exhibits systematic biases relative to human judgments, undermining its reliability for trustworthy evaluation. To address this, we propose Bridgeβ€”a unified statistical framework that jointly models latent human preference scores and the LLM’s deviation mechanism, enabling alignment under both absolute scoring and pairwise comparison paradigms. Bridge innovatively integrates latent variable modeling with linear covariate regression, yielding interpretable bias attribution and calibrated score estimation, while providing rigorous statistical inference guarantees and asymptotic theoretical foundations. Evaluated across six prominent LLM evaluators and two benchmarks (AlpacaEval and MT-Bench), Bridge significantly improves agreement with human judgments: average accuracy increases by 4.2%, calibration error decreases by 31%, and KL divergence reduces by 28%.

Technology Category

Application Category

πŸ“ Abstract
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.
Problem

Research questions and friction points this paper is trying to address.

Systematic divergence between human and LLM judgments
Need for unified framework to bridge evaluation gaps
Characterizing and refining discrepancies in absolute and pairwise scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified statistical framework bridging human-LLM evaluations
Models LLM deviations via linear transformations of covariates
Efficient fitting algorithm with asymptotic guarantees
πŸ”Ž Similar Papers
No similar papers found.