π€ AI Summary
In observational causal inference, covariate balancing via feature selection often struggles to identify nonlinearities and higher-order interactions. This paper proposes Forest Kernel Balancingβa novel method that integrates the implicit leaf-co-occurrence kernels induced by random forests and Bayesian Additive Regression Trees (BART) with outcome-guided feature learning. It automatically extracts nonlinear and high-order interaction features critical for potential outcome prediction, directly embedding them into the covariate balancing procedure. Crucially, balancing is thus made endogenous to the causal effect estimation objective, overcoming a key limitation of conventional kernel-based approaches that ignore outcome information. Extensive simulations and empirical analyses demonstrate that our method substantially improves the accuracy and stability of treatment effect estimation: both statistical bias and variance are reduced simultaneously, while computational efficiency surpasses standard kernel balancing methods.
π Abstract
While balancing covariates between groups is central for observational causal inference, selecting which features to balance remains a challenging problem. Kernel balancing is a promising approach that first estimates a kernel that captures similarity across units and then balances a (possibly low-dimensional) summary of that kernel, indirectly learning important features to balance. In this paper, we propose forest kernel balancing, which leverages the underappreciated fact that tree-based machine learning models, namely random forests and Bayesian additive regression trees (BART), implicitly estimate a kernel based on the co-occurrence of observations in the same terminal leaf node. Thus, even though the resulting kernel is solely a function of baseline features, the selected nonlinearities and other interactions are important for predicting the outcome -- and therefore are important for addressing confounding. Through simulations and applied illustrations, we show that forest kernel balancing leads to meaningful computational and statistical improvement relative to standard kernel methods, which do not incorporate outcome information when learning features.