🤖 AI Summary
Existing off-policy evaluation (OPE) methods rely on auxiliary data but lack rigorous uncertainty quantification, limiting their reliability in high-stakes domains such as healthcare. This paper establishes the first theoretically grounded confidence intervals for policy value estimation in high-dimensional state Markov decision processes. First, for a single initial state, we propose a conditional confidence interval construction method based on conformal prediction. Second, for estimating the average performance over multiple initial states, we integrate doubly robust estimation with prediction-driven inference to achieve robust interval estimation. Our approach unifies generative model augmentation, high-dimensional modeling, and conformal inference. We validate it on robotic control and inventory management simulations, as well as real-world MIMIC-IV clinical data. Empirical results demonstrate significantly improved coverage and precision compared to state-of-the-art OPE methods, advancing trustworthy policy evaluation in safety-critical applications.
📝 Abstract
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^π(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.