Causal K-Means Clustering

📅 2024-05-05
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of identifying heterogeneous causal effects when subgroup structures are unknown and difficult to estimate accurately. We propose “Causal k-Means Clustering”—the first extension of k-means to the counterfactual function space—to automatically discover subgroups exhibiting similar causal responses. Our method innovatively integrates plug-in estimation with double machine learning–based bias correction, ensuring √n-consistency and asymptotic normality of the estimator, while accommodating multiple treatment levels and partially observed outcomes. Theoretically, we derive the nonparametric efficiency bound and establish the semiparametric efficiency of our estimator. Empirically, we demonstrate its superiority over existing approaches in both simulation studies and an application to adolescent substance abuse intervention data. The proposed method thus offers strong theoretical guarantees—including optimality and robustness—alongside practical implementability for real-world causal heterogeneity analysis.

Technology Category

Application Category

📝 Abstract
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse.
Problem

Research questions and friction points this paper is trying to address.

Identify heterogeneous treatment effects in subgroups.
Cluster unknown counterfactual functions using k-means.
Develop bias-corrected estimators for nonparametric models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal k-Means Clustering
Plug-in estimator
Bias-corrected estimator
🔎 Similar Papers
No similar papers found.
K
Kwangho Kim
Department of Statistics, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea
J
Jisu Kim
Department of Statistics, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
Edward H. Kennedy
Edward H. Kennedy
Associate Professor of Statistics & Data Science, Carnegie Mellon University
causal inferencenonparametricsmachine learninghealth & public policy