🤖 AI Summary
This work addresses the critical yet underexplored dependence of off-policy evaluation (OPE) accuracy on the design of logging policies, for which no systematic theoretical guidance currently exists. Focusing on a given target policy, the study formally characterizes the fundamental trade-off between reward concentrability and action coverage, and introduces a unified optimization framework to derive theoretically optimal logging policies under three distinct prior knowledge settings: when the target policy and reward distribution are fully known, completely unknown, or partially known. Grounded in statistical learning and decision theory, and integrating Bayesian priors with noise modeling, the framework not only elucidates principled design guidelines across varying information regimes but also yields practical, efficient data collection strategies for real-world applications such as recommender systems, substantially improving OPE accuracy and sample efficiency.
📝 Abstract
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.