Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

📅 2024-09-16
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing semiparametric inference theory requires multi-source individual-level data (e.g., cohort, case-control, and validation studies) to admit a unique decomposition via a single joint distribution—an assumption violated in practice when integrating data with heterogeneous conditional distributions, as arises in instrumental variable analysis, measurement error correction, and epidemiological data fusion. Method: We propose a unified data fusion theory, introducing for the first time a general influence function characterization framework that does not rely on a shared joint distribution. Our approach leverages alignment assumptions between conditional and marginal distributions, integrating semiparametric efficiency theory, influence function methodology, and machine learning–based debiased estimation. Contribution: We fully characterize the efficient influence function for any pathwise differentiable parameter and derive the class of all regular asymptotically linear estimators. This enables semiparametrically efficient, robust, and scalable inference for finite-dimensional parameters under multi-source heterogeneous data, providing novel theoretical foundations and algorithmic support for causal inference and cross-design epidemiological integration.

Technology Category

Application Category

📝 Abstract
We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.
Problem

Research questions and friction points this paper is trying to address.

Extending semiparametric theory for data fusion with non-aligned conditional distributions
Addressing limitations in two-sample instrumental variable and epidemiological studies
Enabling efficient estimation across multiple independent data sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends theory for fusing non-aligned conditional distributions
Provides universal influence functions for any parameter
Enables machine-learning debiased semiparametric efficient estimation
🔎 Similar Papers
No similar papers found.