🤖 AI Summary
This paper addresses the problem of selecting an appropriate covariate balance metric in randomized experiments. Methodologically, it establishes a unified theoretical framework for quadratic-form rerandomization, characterizing the asymptotic statistical properties of rerandomization under any positive semidefinite matrix $A$, and elucidating how different choices of $A$ induce distinct balance criteria. Theoretically, it proves that Euclidean distance—corresponding to $A = I$—is minimax-optimal for variance reduction, and substantially outperforms alternatives such as Mahalanobis distance in high-dimensional settings. Monte Carlo simulations confirm its robust superiority in estimation accuracy for average treatment effects. The primary contribution is the first general theoretical analysis framework for quadratic-form balance metrics, which formally justifies Euclidean distance as the default, statistically principled choice. This yields a simple, efficient, and broadly applicable covariate balance criterion for experimental design.
📝 Abstract
In the design stage of a randomized experiment, one way to ensure treatment and control groups exhibit similar covariate distributions is to randomize treatment until some prespecified level of covariate balance is satisfied. This experimental design strategy is known as rerandomization. Most rerandomization methods utilize balance metrics based on a quadratic form $v^TAv$ , where $v$ is a vector of covariate mean differences and $A$ is a positive semi-definite matrix. In this work, we derive general results for treatment-versus-control rerandomization schemes that employ quadratic forms for covariate balance. In addition to allowing researchers to quickly derive properties of rerandomization schemes not previously considered, our theoretical results provide guidance on how to choose the matrix $A$ in practice. We find the Mahalanobis and Euclidean distances optimize different measures of covariate balance. Furthermore, we establish how the covariates' eigenstructure and their relationship to the outcomes dictates which matrix $A$ yields the most precise mean-difference estimator for the average treatment effect. We find that the Euclidean distance is minimax optimal, in the sense that the mean-difference estimator's precision is never too far from the optimal choice, regardless of the relationship between covariates and outcomes. Our theoretical results are verified via simulation, where we find that rerandomization using the Euclidean distance has better performance in high-dimensional settings and typically achieves greater variance reduction to the mean-difference estimator than other quadratic forms.