Learning Counterfactual Distributions via Kernel Nearest Neighbors

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses counterfactual distribution estimation and distribution matrix completion in multi-unit (e.g., users, regions)–multi-outcome (e.g., expenditure, engagement) settings, where data suffer from nonignorable missingness (MNAR), unobserved confounding, and severe sparsity—often with only a few observations per unit–outcome pair. We propose the first distributed matrix completion framework for distributional data: it leverages kernel mean embeddings to define distributional neighborhoods and enables consistent distribution recovery even under MNAR mechanisms without positivity. The method is robust to heteroscedastic noise, requiring only ≥2 observations per unit–outcome. We establish theoretical consistency of distribution recovery and enforce distributional similarity via the Maximum Mean Discrepancy (MMD). Experiments demonstrate that our approach significantly outperforms existing single-sample nearest-neighbor and standard matrix completion methods under sparse, biased-sampling, and heteroscedastic regimes.

Technology Category

Application Category

📝 Abstract
Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data, where observations are available only for certain unit-outcome combinations and the observation availability can be correlated with the properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries, a robustness not present in prior works on nearest neighbors with single measurements.
Problem

Research questions and friction points this paper is trying to address.

Learning multivariate distributions with missing not at random data
Estimating distributions from finite samples using kernel nearest neighbors
Handling unobserved confounding and heteroscedastic noise robustly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel nearest neighbors for distribution estimation
Distributional matrix completion with missing data
Maximum mean discrepancy embeddings for consistency
🔎 Similar Papers
No similar papers found.