🤖 AI Summary
Probabilistic record linkage introduces linkage errors due to the absence of error-free linking variables, leading to biased downstream statistical inference. To address this, we propose an Iterative Bootstrap–based framework for linkage bias correction—the first systematic application of iterative bootstrap to bias mitigation in multi-source data integration. Our method comprises probabilistic linkage, iterative resampling, construction of bias-corrected estimators, and associated statistical testing. Crucially, we introduce a novel variance–bias trade-off diagnostic test that automatically detects when further iterations inflate variance without reducing bias, thereby enhancing estimator robustness. Experiments on simulated hormonal data and real-world linked administrative data from the Australian Bureau of Statistics’ Labour Mobility Survey demonstrate that our approach significantly reduces linkage-induced bias while effectively constraining variance inflation.
📝 Abstract
By amalgamating data from disparate sources, the resulting integrated dataset becomes a valuable resource for statistical analysis. In probabilistic record linkage, the effectiveness of such integration relies on the availability of linkage variables free from errors. Where this is lacking, the linked data set would suffer from linkage errors and the resultant analyses, linkage bias. This paper proposes a methodology leveraging the bootstrap technique to devise linkage bias-corrected estimators. Additionally, it introduces a test to assess whether increasing the number of bootstrap iterations meaningfully reduces linkage bias or merely inflates variance without further improving accuracy. An application of these methodologies is demonstrated through the analysis of a simulated dataset featuring hormone information, along with a dataset obtained from linking two data sets from the Australian Bureau of Statistics'labour mobility surveys.