🤖 AI Summary
This work addresses the suboptimal variance control and lack of theoretical guarantees in the self-normalized inverse propensity score (SNIPS) estimator for off-policy evaluation (OPE). To overcome these limitations, the authors propose the β*-IPS estimator, which replaces the conventional multiplicative self-normalization with an optimal additive control variate (i.e., baseline correction). Theoretical analysis establishes, for the first time, that the optimal additive baseline strictly dominates SNIPS in terms of asymptotic mean squared error. Moreover, the study reveals that SNIPS is equivalent to employing a fixed—yet generally suboptimal—additive baseline. This contribution provides a theoretically grounded and empirically effective estimation method for OPE in recommendation and ranking systems.
📝 Abstract
Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $\beta^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.