Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

LLM-as-a-judge exhibits systematic biases relative to human judgments, undermining its reliability for trustworthy evaluation. To address this, we propose Bridge—a unified statistical framework that jointly models latent human preference scores and the LLM’s deviation mechanism, enabling alignment under both absolute scoring and pairwise comparison paradigms. Bridge innovatively integrates latent variable modeling with linear covariate regression, yielding interpretable bias attribution and calibrated score estimation, while providing rigorous statistical inference guarantees and asymptotic theoretical foundations. Evaluated across six prominent LLM evaluators and two benchmarks (AlpacaEval and MT-Bench), Bridge significantly improves agreement with human judgments: average accuracy increases by 4.2%, calibration error decreases by 31%, and KL divergence reduces by 28%.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

Problem

Research questions and friction points this paper is trying to address.

Systematic divergence between human and LLM judgments

Need for unified framework to bridge evaluation gaps

Characterizing and refining discrepancies in absolute and pairwise scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified statistical framework bridging human-LLM evaluations

Models LLM deviations via linear transformations of covariates

Efficient fitting algorithm with asymptotic guarantees

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks