🤖 AI Summary
This work addresses domain adaptation under systematic missingness of target variables in the target domain, where the source domain provides a complete Gaussian causal DAG. The authors propose an EM framework that jointly leverages source and target data, exploiting the known causal structure to re-estimate only those local mechanisms affected by distributional shift. To enhance scalability in high dimensions while preserving convergence guarantees, the traditional M-step is replaced with a first-order gradient update. Experimental results on synthetic data, the MAGIC-IRRI gene regulatory network, and the Sachs protein signaling dataset demonstrate that the proposed method significantly outperforms both the source-domain Bayesian network and the Kiiveri-style EM baseline in terms of imputation accuracy for target variables, with particularly pronounced gains under strong domain shift.
📝 Abstract
We study the problem of imputing a designated target variable that is systematically missing in a shifted deployment domain, when a Gaussian causal DAG is available from a fully observed source domain. We propose a unified EM-based framework that combines source and target data through the DAG structure to transfer information from observed variables to the missing target. On the methodological side, we formulate a population EM operator in the DAG parameter space and introduce a first-order (gradient) EM update that replaces the costly generalized least-squares M-step with a single projected gradient step. Under standard local strong-concavity and smoothness assumptions and a BWY-style \cite{Balakrishnan2017EM} gradient-stability (bounded missing-information) condition, we show that this first-order EM operator is locally contractive around the true target parameters, yielding geometric convergence and finite-sample guarantees on parameter error and the induced target-imputation error in Gaussian SEMs under covariate shift and local mechanism shifts. Algorithmically, we exploit the known causal DAG to freeze source-invariant mechanisms and re-estimate only those conditional distributions directly affected by the shift, making the procedure scalable to higher-dimensional models. In experiments on a synthetic seven-node SEM, the 64-node MAGIC-IRRI genetic network, and the Sachs protein-signaling data, the proposed DAG-aware first-order EM algorithm improves target imputation accuracy over a fit-on-source Bayesian network and a Kiiveri-style EM baseline, with the largest gains under pronounced domain shift.