Is merging worth it? Securely evaluating the information gain for causal dataset acquisition

📅 2024-09-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

In cross-institutional causal analysis, quantifying the information gain from data merging on heterogeneous treatment effect (HTE) estimation—without exposing raw data—is a critical challenge in privacy-sensitive settings. This paper introduces the first cryptographically secure, information-theoretic framework that tightly integrates expected information gain (EIG) modeling with secure multi-party computation (MPC), enabling provably private quantification of merging value. Our method innovatively jointly captures dual benefits: improved overlap region quality and reduced estimation uncertainty—yielding significantly higher accuracy than differential-privacy baselines. We rigorously validate the approach on multiple synthetic and real-world benchmark datasets, demonstrating high estimation accuracy, formal privacy guarantees (under standard MPC security definitions), and practical feasibility. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge will depend not only on the reduction in epistemic uncertainty but also the improvement in overlap. To address this challenge, we introduce the first cryptographically secure information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the Expected Information Gain (EIG) and utilising multi-party computation to ensure it can be securely computed without revealing any raw data. As we demonstrate, this can be used with differential privacy (DP) to ensure privacy requirements whilst preserving more accurate computation than naive DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation. We demonstrate the effectiveness and reliability of our method on a range of simulated and realistic benchmarks. The code is available anonymously.

Problem

Research questions and friction points this paper is trying to address.

Securely evaluate dataset merge benefits for causal estimation.

Quantify merge value using privacy-preserving cryptographic methods.

Ensure data privacy while improving causal effect estimation accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cryptographically secure information-theoretic approach

Multi-party computation for secure data evaluation

Differential privacy integration for enhanced accuracy

🔎 Similar Papers

No similar papers found.