🤖 AI Summary
This study addresses the “winner’s curse” in A/B testing, where selection effects induce upward bias in treatment effect estimates and invalidate confidence intervals—particularly under low statistical power, thereby compromising decision-making. To mitigate this, the authors propose the Bayesian Hierarchical Shrinkage (BHS) method, which operates within an empirical Bayes framework and introduces experiment-specific local shrinkage factors. By constructing data-driven priors that incorporate individual experiment characteristics, BHS effectively corrects for selection bias while enhancing robustness to prior misspecification, overcoming the limitations of traditional approaches that apply uniform shrinkage. The method admits closed-form inference, making it suitable for high-throughput production environments. Empirical evaluations on both simulated and real-world Meta data demonstrate that BHS substantially reduces estimation bias and achieves more accurate interval coverage, even under severe model misspecification.
📝 Abstract
The widespread adoption of randomized controlled trials (A/B Tests) for decision-making has introduced a pervasive "Winner's Curse": experiments selected for launch often exhibit upwardly biased effect estimates and invalid confidence intervals. This selection bias leads to over-optimistic impact projections and undermines decision-making, particularly in low-power regimes. We propose Bayesian Hybrid Shrinkage (BHS), an empirical Bayes (EB) framework that leverages data-driven priors to mitigate selection bias and provides accurate uncertainty quantification. Unlike traditional EB methods that apply uniform shrinkage, BHS introduces an experiment-specific "local" shrinkage factor that incorporates individual experiment characteristics, improving robustness against prior misspecification. We also derive a closed-form inference strategy designed for high-throughput production environments. Extensive simulations and real-world evaluations at Meta Platforms demonstrate that BHS outperforms existing methods in terms of bias reduction and interval coverage, even under substantial violations of modeling assumptions.