🤖 AI Summary
Traditional A/B testing methods suffer from low statistical power and poor detection capability under small sample sizes. To address this, we propose a novel two-armed bandit testing framework that integrates causal inference with reinforcement learning. Our approach innovatively combines doubly robust estimation with dynamic bandit policies, enabling efficient exploration via pseudo-outcome construction, and employs permutation testing for exact p-value computation. We establish theoretical consistency of the proposed test statistic. Extensive simulations and experiments on real-world ride-hailing data demonstrate that, at equal sample sizes, our method achieves significantly higher statistical power—averaging a 23.6% improvement over state-of-the-art baselines—while enhancing the accuracy of policy effect comparisons and decision reliability. The framework is particularly effective in small-sample, high-variance settings, offering a principled solution for robust online experimentation under limited data.
📝 Abstract
A/B testing is widely used in modern technology companies for policy evaluation and product deployment, with the goal of comparing the outcomes under a newly-developed policy against a standard control. Various causal inference and reinforcement learning methods developed in the literature are applicable to A/B testing. This paper introduces a two-armed bandit framework designed to improve the power of existing approaches. The proposed procedure consists of three main steps: (i) employing doubly robust estimation to generate pseudo-outcomes, (ii) utilizing a two-armed bandit framework to construct the test statistic, and (iii) applying a permutation-based method to compute the $p$-value. We demonstrate the efficacy of the proposed method through asymptotic theories, numerical experiments and real-world data from a ridesharing company, showing its superior performance in comparison to existing methods.