๐ค AI Summary
This work addresses the lack of theoretically grounded and computationally efficient nonparametric generative frameworks for handling non-monotone missing-at-random (MAR) data. The authors propose FLOWGEM, a novel approach that uniquely integrates Wasserstein gradient flows with local linear density ratio estimation. By iteratively minimizing the expected KL divergence between observed data and generated samples across varying missingness patterns, FLOWGEM enables principled generation of missing values. Built upon a discrete particle evolution scheme coupled with an iterative transport mechanism, the method demonstrates consistently superior performance over existing imputation techniques on both synthetic and real-world datasets, particularly excelling in challenging non-monotone MAR settings.
๐ Abstract
The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.