Evaluating Variance Estimates with Relative Efficiency

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In industrial experimentation platforms, biased variance estimation distorts confidence intervals, undermining statistical reliability. Conventional A/A tests diagnose issues solely via binary false positive rate (FPR), discarding effect-size information and suffering from low sample efficiency and insufficient sensitivity. This paper proposes a novel variance diagnostic method based on the *t*² statistic: within the A/A testing framework, *t*² replaces FPR as the primary monitoring metric, preserving the magnitude information of the original test statistic. Theoretical analysis and empirical evaluation demonstrate that, at equal sample sizes, the *t*²-based approach exhibits heightened sensitivity to variance bias and noise—achieving substantially higher relative efficiency than traditional FPR-based diagnostics. Consequently, it detects variance estimation failure earlier and more robustly, thereby enhancing the statistical credibility and monitoring effectiveness of large-scale experimentation platforms.

Technology Category

Application Category

📝 Abstract

Experimentation platforms in industry must often deal with customer trust issues. Platforms must prove the validity of their claims as well as catch issues that arise. As a central quantity estimated by experimentation platforms, the validity of confidence intervals is of particular concern. To ensure confidence intervals are reliable, we must understand and diagnose when our variance estimates are biased or noisy, or when the confidence intervals may be incorrect. A common method for this is A/A testing, in which both the control and test arms receive the same treatment. One can then test if the empirical false positive rate (FPR) deviates substantially from the target FPR over many tests. However, this approach turns each A/A test into a simple binary random variable. It is an inefficient estimate of the FPR as it throws away information about the magnitude of each experiment result. We show how to empirically evaluate the effectiveness of statistics that monitor the variance estimates that partly dictate a platform's statistical reliability. We also show that statistics other than empirical FPR are more effective at detecting issues. In particular, we propose a $t^2$-statistic that is more sample efficient.

Problem

Research questions and friction points this paper is trying to address.

Evaluating variance estimate reliability in experimentation platforms

Addressing inefficiency of empirical false positive rate monitoring

Proposing more effective t²-statistic for detecting variance issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed t-squared statistic for variance monitoring

Used relative efficiency to evaluate variance estimates

Compared multiple statistics beyond empirical false positive rate

🔎 Similar Papers

No similar papers found.