🤖 AI Summary
In small-sample randomized clinical trials, inference for covariate-adjusted risk difference estimates lacks methods that simultaneously ensure robustness, efficiency, and proper Type I error control. This study investigates the performance of unconditional exact tests, the Mantel–Haenszel method, and several g-computation approaches—including standard, robust, and penalized variants—through simulation. The results reveal that Type I error inflation primarily stems from a mismatch between the target parameter, variance estimation, and inferential objective, rather than merely from limited sample size. Accordingly, the work proposes a principled criterion for method selection that aligns these components: standard g-computation often leads to inflated Type I error in very small samples, whereas its robust or penalized alternatives improve error control at the cost of reduced power; classical methods like Mantel–Haenszel, while conservative, demonstrate consistent robustness.
📝 Abstract
Binary endpoints are common in clinical trials and conditional odds ratios have traditionally been used to assess treatment effects. However, the interpretation of odds ratios is difficult, they are non-collapsible and rely on strong assumptions in order to be a relevant overall summary measure for the trial. As an alternative, risk differences have gained increasing prominence as a more interpretable, clinically meaningful and assumption-lean measure of treatment effects. This shift has also been motivated by new regulatory guidance, which emphasizes the relevance of marginal estimands and encourages covariate adjustment. Yet, covariate-adjusted inference for risk differences, particularly in smaller samples, has methodological subtleties and lacks well-established best practices. We conduct a simulation study comparing methods for estimating and testing risk differences in small-sample ($N \leq 150$) randomized clinical trials with prognostic categorical baseline covariates, focusing on exact unconditional tests, Mantel-Haenszel methods, and $g$-computation (standardization) approaches. We find that several $g$-computation approaches exhibit inflated Type-I error in very small samples when standard Wald-type inference is applied, whereas robust or penalized variants improve error control at the expense of power. Classical methods such as the Mantel-Haenszel and Suissa-Shuster tests remain robust but may forgo efficiency gains from covariate adjustment. Overall, our results indicate that much of the observed Type-I error inflation reflects misalignment between estimand and variance estimation rather than small sample size alone. Based on these results, we provide practical recommendations to guide method selection that align the estimand, variance estimation, and inferential target.