🤖 AI Summary
This study addresses the disconnect between academic research and industrial practice in treatment effect estimation, where prevailing evaluation paradigms hinder real-world applicability. Through a large-scale empirical analysis, we systematically compare diverse meta-learners, base learners, and specialized causal models across semi-synthetic benchmarks and real-world datasets. Our findings reveal a pronounced inconsistency between counterfactual and observable performance metrics, and demonstrate that model rankings derived from semi-synthetic data fail to generalize to real settings. Notably, simple meta-learners paired with strong base models consistently outperform purpose-built causal models on real data, underscoring the critical importance of validation on real-world outcomes and observable metrics. These results challenge the dominant reliance on semi-synthetic evaluations and call for a paradigm shift toward more empirically grounded assessment protocols.
📝 Abstract
Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.