π€ AI Summary
This work addresses the challenge of effectively integrating a small amount of coupled data with abundant uncoupled marginal observations to enhance downstream statistical inference. The authors propose a fully nonparametric approach that aligns marginal data with limited coupled samples via optimal transport projections and introduces an explicit estimator grounded in the notion of βshadowβ couplings to extrapolate the dependence structure and improve estimation accuracy. The method offers geometric interpretability, numerical stability, and near-linear-time parallelizability. Theoretical guarantees are established by synthesizing tools from optimal transport theory, projection-based estimation, and sample complexity analysis. Extensive experiments on both synthetic and real-world datasets demonstrate the methodβs high accuracy and computational efficiency.
π Abstract
In many statistical settings, two types of data are available: coupled data, which preserve the joint structure among variables but are limited in size due to cost or privacy constraints, and marginal data, which are available at larger scales but lack joint structure. Since standard methods require coupled data, marginal information is often discarded. We propose a fully nonparametric procedure that integrates decoupled marginal data with a limited amount of coupled data to improve the downstream analysis. The approach can be understood as an extension of coupling via projection in optimal transport. Specifically, the estimator is a solution for the optimal transport projection over the space of probability measures, which genuinely provides a natural geometric interpretation. Not only is its stability established, but its sample complexity is also derived using recent advances in statistical optimal transport. In addition to this, we present its explicit formula based on ``shadow," a notion introduced by Eckstein and Nutz. Furthermore, the estimator can be approximated in almost linear time and in parallel by entropic shadow, which demonstrates the theoretical and practical strengths of our methods. Lastly, we present experiments with real and synthetic data to justify the performance of our method.