Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper addresses two pervasive challenges in human evaluation of generative AI: nonresponse (i.e., missing responses) and clustering dependence (e.g., multiple turns from the same user inducing response correlation). We propose the first doubly robust estimation framework specifically designed for clustered data structures. Theoretically, we establish, for the first time, the unbiasedness, consistency, and asymptotic normality of this estimator under both cluster-level dependence and non-ignorable missingness. Methodologically, we jointly model the response generation mechanism and within-cluster correlation structure to achieve unbiased estimation of average scores and calibrated uncertainty quantification. Experiments on synthetic and real-world dialogue quality datasets demonstrate that ignoring clustering leads to underestimated standard errors and invalid statistical inference, whereas our method substantially improves estimation accuracy and confidence interval coverage. The framework thus provides a statistically rigorous and practically viable foundation for AI evaluation.

Technology Category

Application Category

📝 Abstract

Human annotations play a crucial role in evaluating the performance of GenAI models. Two common challenges in practice, however, are missing annotations (the response variable of interest) and cluster dependence among human-AI interactions (e.g., questions asked by the same user may be highly correlated). Reliable inference must address both these issues to achieve unbiased estimation and appropriately quantify uncertainty when estimating average scores from human annotations. In this paper, we analyze the doubly robust estimator, a widely used method in missing data analysis and causal inference, applied to this setting and establish novel theoretical properties under cluster dependence. We further illustrate our findings through simulations and a real-world conversation quality dataset. Our theoretical and empirical results underscore the importance of incorporating cluster dependence in missing response problems to perform valid statistical inference.

Problem

Research questions and friction points this paper is trying to address.

Addressing missing annotations in human evaluation data

Handling cluster dependence in human-AI interaction correlations

Achieving unbiased estimation with proper uncertainty quantification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly robust estimator addresses missing annotations

Novel theoretical properties under cluster dependence

Incorporating cluster dependence for valid inference

🔎 Similar Papers

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

2023-11-15Conference on Empirical Methods in Natural Language ProcessingCitations: 2

Bosch Group

Renningen, BW, DE

Authors to Follow