Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the pervasive bias in standard estimators caused by non-randomly missing user feedback data (missing not at random, MNAR), a challenge exacerbated by the strong parametric assumptions or reliance on hard-to-obtain auxiliary variables in conventional approaches. The authors propose a partial identification framework that introduces the notion of a “weak shadow variable,” relaxing the stringent completeness condition required by classical shadow variables. By leveraging outputs from a pre-trained large language model (LLM) as conditional independence constraints and integrating them with the observed data structure, the method derives sharp identification bounds for the target parameter via linear programming. Empirical evaluations on simulated and semi-synthetic customer service dialogues demonstrate a 75%–83% reduction in identification interval width while maintaining valid coverage under MNAR. The approach achieves √n-consistent point estimation when point identification holds and retains a sub-√n convergence rate under set identification.

Technology Category

Application Category

📝 Abstract

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.

Problem

Research questions and friction points this paper is trying to address.

missing not at random

partial identification

shadow variables

nonresponse bias

population estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

partial identification

weak shadow variables

missing not at random (MNAR)