PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing off-policy evaluation (OPE) methods rely on auxiliary data but lack rigorous uncertainty quantification, limiting their reliability in high-stakes domains such as healthcare. This paper establishes the first theoretically grounded confidence intervals for policy value estimation in high-dimensional state Markov decision processes. First, for a single initial state, we propose a conditional confidence interval construction method based on conformal prediction. Second, for estimating the average performance over multiple initial states, we integrate doubly robust estimation with prediction-driven inference to achieve robust interval estimation. Our approach unifies generative model augmentation, high-dimensional modeling, and conformal inference. We validate it on robotic control and inventory management simulations, as well as real-world MIMIC-IV clinical data. Empirical results demonstrate significantly improved coverage and precision compared to state-of-the-art OPE methods, advancing trustworthy policy evaluation in safety-critical applications.

Technology Category

Application Category

📝 Abstract
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^π(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.
Problem

Research questions and friction points this paper is trying to address.

Estimating new RL policy value before deployment
Handling biased auxiliary datasets in OPE
Providing reliable uncertainty quantification for policy evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal prediction for high-dimensional MDPs
Doubly robust estimation for policy performance
Prediction powered inference with augmented data
🔎 Similar Papers
No similar papers found.
A
Aishwarya Mandyam
Stanford University
J
Jason Meng
Stanford University
G
Ge Gao
Stanford University
J
Jiankai Sun
Stanford University
Mac Schwager
Mac Schwager
Stanford University
RoboticsControlMulti-Agent SystemsMachine LearningStatistical Inference and Estimation
B
Barbara E. Engelhardt
The Gladstone Institutes, Stanford University
Emma Brunskill
Emma Brunskill
Associate Professor of Computer Science, Stanford University
Reinforcement LearningMachine LearningDecision Making Under UncertaintyOnline Education