🤖 AI Summary
This study addresses the challenge of accurately inferring the distribution of individual treatment effects—such as the proportion benefiting, the median effect, or the maximum impact—in randomized experiments, without suffering power loss due to suboptimal pre-specified test statistics. The authors propose an adaptive randomization test that combines multiple rank-based statistics, ensuring finite-sample validity without requiring prior knowledge of the optimal statistic. Innovatively integrating adaptive statistic combination with stratified weighting, the method effectively circumvents the power degradation typically induced by multiple comparison corrections and accommodates heterogeneous stratified experimental designs. In an empirical application to a teacher training program, the approach reveals that approximately half of the teachers experience significant benefits, demonstrating superior detection power and interpretability compared to conventional single rank-based tests.
📝 Abstract
What proportion of treated units actually benefited from an experimental intervention? What is the median or the largest individual treatment effect? This paper develops methods for answering such questions about the distribution of individual causal effects in randomized experiments. Existing approaches require the analyst to select a rank-based test statistic before observing the data. A poor choice can substantially reduce power, while searching over multiple test statistics and adjusting for multiplicity using Bonferroni correction also incurs power loss. We propose inference procedures that adaptively combine multiple rank-based statistics while maintaining finite-sample validity. For stratified experiments, we further develop weighting schemes that effectively aggregate evidence across strata of heterogeneous sizes. The resulting combined test achieves power comparable to, or exceeding, that of the best individual test, without requiring prior knowledge of the optimal statistic. When applied to a randomized experiment evaluating a teacher training program, the combined test suggests that roughly half of treated teachers benefited, whereas a single rank-based test may indicate only a small minority. Thus, the choice of test determined whether the program appears broadly successful or narrowly effective.