Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Generative AI evaluation suffers from external validity challenges: human annotator demographics and output sample distributions in laboratory settings often deviate from real-world deployment conditions, leading to biased quality estimates. To address this, we propose a doubly robust evaluation framework that integrates large language model (LLM)-simulated, diverse annotator personas with propensity score reweighting and outcome regression modeling to yield unbiased system quality estimates. Its double robustness property—guaranteeing consistency if either the persona model or the reweighting/regression model is correctly specified—enhances reliability under distributional shift. We theoretically establish estimator consistency and empirically validate robustness across multiple bias configurations and persona fidelity levels via a persona simulation framework. Our key contribution is the first systematic integration of LLM-driven fine-grained population modeling with causal inference techniques to address generalizability limitations in GenAI evaluation.

Technology Category

Application Category

📝 Abstract

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Problem

Research questions and friction points this paper is trying to address.

Addresses evaluation sampling bias in GenAI system assessments

Combines imperfect LLM persona ratings with biased human ratings

Ensures statistically valid system quality estimates for deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly-robust framework corrects evaluation sampling bias

LLM personas simulate human raters with specific characteristics

Combines imperfect persona ratings with biased human ratings

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks