🤖 AI Summary
In A/B testing, the t-test relies on the Central Limit Theorem for normal approximation, but suffers from inflated Type-I error rates and miscalibrated confidence intervals under small sample sizes or skewed data. To address this, we propose an empirical validation framework: repeatedly resampling from A/A tests to generate a null p-value distribution, and—novelly—applying the Kolmogorov–Smirnov test to assess its uniformity, thereby directly evaluating the validity of the t-test’s normality assumption. This approach is distribution-agnostic, requiring no prior assumptions, and automatically detects violations of asymptotic normality. Experiments demonstrate substantial improvements in p-value calibration and confidence interval coverage, with effective suppression of false positive rates under common skewness scenarios. Our work provides a practical, reproducible diagnostic tool to enhance statistical reliability in A/B testing.
📝 Abstract
A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical $t$-test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the $t$-test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what"sufficiently large"entails is not straightforward. To ensure that confidence intervals maintain proper coverage and that $p$-values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting $p$-value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the $t$-test's assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices.