🤖 AI Summary
This work addresses the lack of theoretical and algorithmic foundations for score matching under arbitrary random missingness—where any subset of coordinates may be missing. We propose the first unified framework for score matching under general missing-data mechanisms, introducing two novel paradigms: Importance-Weighted Score Matching (IW-SM), emphasizing robustness in low-dimensional, small-sample settings; and Variational Score Matching (VI-SM), prioritizing accuracy in high-dimensional, complex tasks. Theoretically, we establish finite-sample consistency guarantees for both estimators. Empirically, IW-SM significantly outperforms existing baselines in low-dimensional graphical model estimation, while VI-SM achieves state-of-the-art performance on high-dimensional real and synthetic datasets with missing values. Our framework systematically extends the applicability and practical utility of score matching to incomplete data scenarios, enabling principled density estimation without requiring imputation or restrictive missingness assumptions.
📝 Abstract
Score matching is a vital tool for learning the distribution of data with applications across many areas including diffusion processes, energy based modelling, and graphical model estimation. Despite all these applications, little work explores its use when data is incomplete. We address this by adapting score matching (and its major extensions) to work with missing data in a flexible setting where data can be partially missing over any subset of the coordinates. We provide two separate score matching variations for general use, an importance weighting (IW) approach, and a variational approach. We provide finite sample bounds for our IW approach in finite domain settings and show it to have especially strong performance in small sample lower dimensional cases. Complementing this, we show our variational approach to be strongest in more complex high-dimensional settings which we demonstrate on graphical model estimation tasks on both real and simulated data.