🤖 AI Summary
Traditional linear-model t/F-tests assume fixed sample sizes, rendering them unsuitable for sequential A/B testing requiring continuous monitoring and early stopping—leading to uncontrolled Type-I error inflation. This paper proposes an anytime-valid causal inference framework under linear regression adjustment, introducing the first closed-form anytime-valid F-tests and confidence sequences for both parametric and nonparametric settings. Without imposing strong modeling assumptions, the method guarantees uniform Type-I error control and valid confidence coverage throughout the entire sequential experiment under standard randomized designs. All test statistics are directly computable from standard regression outputs, enabling real-time significance assessment and dynamic confidence interval updating. Deployed on Netflix’s industrial-scale A/B testing platform, the method supports regression-adjusted sequential analysis using pre-treatment covariates, effectively mitigating p-hacking.
📝 Abstract
Linear regression adjustment is commonly used to analyse randomised controlled experiments due to its efficiency and robustness against model misspecification. Current testing and interval estimation procedures leverage the asymptotic distribution of such estimators to provide Type-I error and coverage guarantees that hold only at a single sample size. Here, we develop the theory for the anytime-valid analogues of such procedures, enabling linear regression adjustment in the sequential analysis of randomised experiments. We first provide sequential $F$-tests and confidence sequences for the parametric linear model, which provide time-uniform Type-I error and coverage guarantees that hold for all sample sizes. We then relax all linear model parametric assumptions in randomised designs and provide nonparametric model-free sequential tests and confidence sequences for treatment effects. This formally allows experiments to be continuously monitored for significance, stopped early, and safeguards against statistical malpractices in data collection. A particular feature of our results is their simplicity. Our test statistics and confidence sequences all emit closed-form expressions, which are functions of statistics directly available from a standard linear regression table. We illustrate our methodology with the sequential analysis of software A/B experiments at Netflix, performing regression adjustment with pre-treatment outcomes.