Learning High-dimensional Gaussians from Censored Data

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

249K/year

🤖 AI Summary

This paper addresses parameter estimation for high-dimensional Gaussian distributions under missing-not-at-random (MNAR) mechanisms, focusing on two strong value-dependent missingness models: self-truncation (where coordinate-wise missingness depends on whether the true value lies in a given set) and linear thresholding (where missingness is determined by a linear function of the true value). For known missingness pattern functions $ S(y) $, we propose the first provably efficient algorithm with polynomial sample complexity $ mathrm{poly}(d, 1/varepsilon) $. Our method integrates moment estimation, divide-and-conquer decomposition, probabilistic satisfiability analysis, conditional independence modeling, and total variation (TV) distance control. Theoretically, the algorithm achieves $ varepsilon $-accuracy in TV distance: it exactly recovers the true mean $ mu^* $ and covariance $ Sigma^* $ under self-truncation, and yields robust, scalable mean estimation under linear thresholding. This work constitutes the first solution to high-dimensional learning under MNAR with strong coupling between missingness and latent values.

Technology Category

Application Category

📝 Abstract

We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are missing not at random (MNAR). The missingness model, denoted by $S(y)$, is the function that maps any point $y$ in $R^d$ to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings: (i) Self-censoring: An observation $x$ is generated by first sampling the true value $y$ from a $d$-dimensional Gaussian $N(mu*, Sigma*)$ with unknown $mu*$ and $Sigma*$. For each coordinate $i$, there exists a set $S_i$ subseteq $R^d$ such that $x_i = y_i$ if and only if $y_i$ in $S_i$. Otherwise, $x_i$ is missing and takes a generic value (e.g.,"?"). We design an algorithm that learns $N(mu*, Sigma*)$ up to total variation (TV) distance epsilon, using $poly(d, 1/epsilon)$ samples, assuming only that each pair of coordinates is observed with sufficiently high probability. (ii) Linear thresholding: An observation $x$ is generated by first sampling $y$ from a $d$-dimensional Gaussian $N(mu*, Sigma)$ with unknown $mu*$ and known $Sigma$, and then applying the missingness model $S$ where $S(y) = {i in [d] : v_i^T y<= b_i}$ for some $v_1, ..., v_d$ in $R^d$ and $b_1, ..., b_d$ in $R$. We design an efficient mean estimation algorithm, assuming that none of the possible missingness patterns is very rare conditioned on the values of the observed coordinates and that any small subset of coordinates is observed with sufficiently high probability.

Problem

Research questions and friction points this paper is trying to address.

Learning high-dimensional Gaussians from censored MNAR data

Efficient algorithms for self-censoring missing data patterns

Mean estimation under linear thresholding missingness conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient algorithms for high-dimensional Gaussian learning

Handles non-random missing data (MNAR) with known model

Uses poly(d, 1/ε) samples for TV distance ε

🔎 Similar Papers

Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique