🤖 AI Summary
Generative AI evaluation suffers from external validity challenges: human annotator demographics and output sample distributions in laboratory settings often deviate from real-world deployment conditions, leading to biased quality estimates. To address this, we propose a doubly robust evaluation framework that integrates large language model (LLM)-simulated, diverse annotator personas with propensity score reweighting and outcome regression modeling to yield unbiased system quality estimates. Its double robustness property—guaranteeing consistency if either the persona model or the reweighting/regression model is correctly specified—enhances reliability under distributional shift. We theoretically establish estimator consistency and empirically validate robustness across multiple bias configurations and persona fidelity levels via a persona simulation framework. Our key contribution is the first systematic integration of LLM-driven fine-grained population modeling with causal inference techniques to address generalizability limitations in GenAI evaluation.
📝 Abstract
As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.