🤖 AI Summary
To address confounding bias arising from covariate distribution imbalance in causal inference from observational data, this paper proposes a two-stage interpretable matching framework. In the first stage, exact matching is performed on all covariates to ensure baseline comparability. In the second stage, the least significant confounders are iteratively removed based on feature importance, and an interpretable distance metric learning approach is introduced to quantify proximity with respect to the removed variables. The method simultaneously ensures multivariate overlap and unbiased estimation of conditional average treatment effects (CATE), while substantially enhancing matching transparency and robustness. Experiments on synthetic datasets and real-world CDC healthcare data demonstrate that the proposed approach significantly reduces CATE estimation bias, improves high-dimensional overlap between treatment and control groups, and exhibits strong computational scalability.
📝 Abstract
Matching in causal inference from observational data aims to construct treatment and control groups with similar distributions of covariates, thereby reducing confounding and ensuring an unbiased estimation of treatment effects. This matched sample closely mimics a randomized controlled trial (RCT), thus improving the quality of causal estimates. We introduce a novel Two-stage Interpretable Matching (TIM) framework for transparent and interpretable covariate matching. In the first stage, we perform exact matching across all available covariates. For treatment and control units without an exact match in the first stage, we proceed to the second stage. Here, we iteratively refine the matching process by removing the least significant confounder in each iteration and attempting exact matching on the remaining covariates. We learn a distance metric for the dropped covariates to quantify closeness to the treatment unit(s) within the corresponding strata. We used these high- quality matches to estimate the conditional average treatment effects (CATEs). To validate TIM, we conducted experiments on synthetic datasets with varying association structures and correlations. We assessed its performance by measuring bias in CATE estimation and evaluating multivariate overlap between treatment and control groups before and after matching. Additionally, we apply TIM to a real-world healthcare dataset from the Centers for Disease Control and Prevention (CDC) to estimate the causal effect of high cholesterol on diabetes. Our results demonstrate that TIM improves CATE estimates, increases multivariate overlap, and scales effectively to high-dimensional data, making it a robust tool for causal inference in observational data.