🤖 AI Summary
In causal effect estimation, the absence of standardized hyperparameter tuning evaluation criteria impedes reliable model selection and creates a substantial gap between commonly used metrics and true performance. This paper systematically investigates the interplay between hyperparameter tuning and evaluation, jointly analyzing estimators (T-/X-/R-Learner), base learners (random forests, gradient boosting, neural networks), and evaluation metrics (IPW, DR, PEHE) across four benchmark datasets. Key findings are: (1) thorough hyperparameter tuning eliminates performance differences among mainstream causal estimators; (2) the choice of evaluation strategy exerts greater influence on final performance than either the estimator type or base learner architecture; and (3) existing evaluation metrics underestimate the performance gain from optimal model selection by over 35% on average. These results demonstrate that hyperparameter tuning is the primary determinant of causal estimation accuracy, underscoring an urgent need for more robust, theoretically grounded evaluation paradigms in causal machine learning.
📝 Abstract
The performance of most causal effect estimators relies on accurate predictions of high-dimensional non-linear functions of the observed data. The remarkable flexibility of modern Machine Learning (ML) methods is perfectly suited to this task. However, data-driven hyperparameter tuning of ML methods requires effective model evaluation to avoid large errors in causal estimates, a task made more challenging because causal inference involves unavailable counterfactuals. Multiple performance-validation metrics have recently been proposed such that practitioners now not only have to make complex decisions about which causal estimators, ML learners and hyperparameters to choose, but also about which evaluation metric to use. This paper, motivated by unclear recommendations, investigates the interplay between the four different aspects of model evaluation for causal effect estimation. We develop a comprehensive experimental setup that involves many commonly used causal estimators, ML methods and evaluation approaches and apply it to four well-known causal inference benchmark datasets. Our results suggest that optimal hyperparameter tuning of ML learners is enough to reach state-of-the-art performance in effect estimation, regardless of estimators and learners. We conclude that most causal estimators are roughly equivalent in performance if tuned thoroughly enough. We also find hyperparameter tuning and model evaluation are much more important than causal estimators and ML methods. Finally, from the significant gap we find in estimation performance of popular evaluation metrics compared with optimal model selection choices, we call for more research into causal model evaluation to unlock the optimum performance not currently being delivered even by state-of-the-art procedures.