🤖 AI Summary
This work addresses the challenge of identifying robust predictors when distribution shifts arise from latent confounders and proxy variables violate the completeness assumption. To characterize the set of indistinguishable confounding structures under imperfect proxies, the authors introduce Latent Equivalence Classes (LECs). They establish a weaker cross-domain mixture weight rank condition that enables point identification of the robust predictor. Furthermore, they propose a novel Proximal Quasi-Bayesian Active Learning framework (PQAL) to efficiently select the minimal number of source domains necessary to satisfy the identification condition. Experiments on synthetic and semi-synthetic dSprites datasets demonstrate that the proposed method accurately recovers robust predictors and significantly outperforms baseline approaches across various distribution shifts.
📝 Abstract
Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a minimal set of diverse domains that satisfy this rank condition. PQAL can efficiently recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites dataset.