A Two-armed Bandit Framework for A/B Testing

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional A/B testing methods suffer from low statistical power and poor detection capability under small sample sizes. To address this, we propose a novel two-armed bandit testing framework that integrates causal inference with reinforcement learning. Our approach innovatively combines doubly robust estimation with dynamic bandit policies, enabling efficient exploration via pseudo-outcome construction, and employs permutation testing for exact p-value computation. We establish theoretical consistency of the proposed test statistic. Extensive simulations and experiments on real-world ride-hailing data demonstrate that, at equal sample sizes, our method achieves significantly higher statistical power—averaging a 23.6% improvement over state-of-the-art baselines—while enhancing the accuracy of policy effect comparisons and decision reliability. The framework is particularly effective in small-sample, high-variance settings, offering a principled solution for robust online experimentation under limited data.

Technology Category

Application Category

📝 Abstract
A/B testing is widely used in modern technology companies for policy evaluation and product deployment, with the goal of comparing the outcomes under a newly-developed policy against a standard control. Various causal inference and reinforcement learning methods developed in the literature are applicable to A/B testing. This paper introduces a two-armed bandit framework designed to improve the power of existing approaches. The proposed procedure consists of three main steps: (i) employing doubly robust estimation to generate pseudo-outcomes, (ii) utilizing a two-armed bandit framework to construct the test statistic, and (iii) applying a permutation-based method to compute the $p$-value. We demonstrate the efficacy of the proposed method through asymptotic theories, numerical experiments and real-world data from a ridesharing company, showing its superior performance in comparison to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Improves A/B testing power using two-armed bandit framework
Compares new policy outcomes against standard control efficiently
Enhances causal inference with doubly robust estimation and permutation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly robust estimation for pseudo-outcomes
Two-armed bandit test statistic construction
Permutation-based p-value computation method
🔎 Similar Papers
J
Jinjuan Wang
School of Mathematics and Statistics, Beijing Institute of Technology
Q
Qianglin Wen
Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University
Y
Yu Zhang
Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China
Xiaodong Yan
Xiaodong Yan
Unknown affiliation
统计学,机器学习
Chengchun Shi
Chengchun Shi
London School of Economics and Political Science
Large Language ModelsReinforcement LearningStatistics