🤖 AI Summary
In cross-institutional causal analysis, quantifying the information gain from data merging on heterogeneous treatment effect (HTE) estimation—without exposing raw data—is a critical challenge in privacy-sensitive settings. This paper introduces the first cryptographically secure, information-theoretic framework that tightly integrates expected information gain (EIG) modeling with secure multi-party computation (MPC), enabling provably private quantification of merging value. Our method innovatively jointly captures dual benefits: improved overlap region quality and reduced estimation uncertainty—yielding significantly higher accuracy than differential-privacy baselines. We rigorously validate the approach on multiple synthetic and real-world benchmark datasets, demonstrating high estimation accuracy, formal privacy guarantees (under standard MPC security definitions), and practical feasibility. The implementation is publicly available.
📝 Abstract
Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge will depend not only on the reduction in epistemic uncertainty but also the improvement in overlap. To address this challenge, we introduce the first cryptographically secure information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the Expected Information Gain (EIG) and utilising multi-party computation to ensure it can be securely computed without revealing any raw data. As we demonstrate, this can be used with differential privacy (DP) to ensure privacy requirements whilst preserving more accurate computation than naive DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation. We demonstrate the effectiveness and reliability of our method on a range of simulated and realistic benchmarks. The code is available anonymously.