Policy Testing in Markov Decision Processes

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of hypothesis testing for policy values in discounted Markov decision processes (MDPs): given a fixed confidence level, determine with minimal samples whether a given policy’s value exceeds a specified threshold. To overcome the computational intractability of conventional non-convex lower-bound optimization, we reformulate the problem as a policy optimization task subject to convex constraints—revealing its equivalence to policy gradient optimization in an inverse MDP. This unifies statistical optimality with computational feasibility. Based on this insight, we propose the first pure-exploration policy testing algorithm that achieves instance-dependent lower-bound matching and admits efficient implementation. Numerical experiments demonstrate that our algorithm significantly outperforms existing baselines in sample efficiency.

Technology Category

Application Category

📝 Abstract
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
Problem

Research questions and friction points this paper is trying to address.

Determine if policy value exceeds threshold efficiently
Derive instance-specific lower bound for observations
Design statistically optimal and tractable testing algorithm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Derives instance-specific lower bound via optimization
Reformulates problem with convex constraints for tractability
Uses policy gradient methods in reversed MDP
🔎 Similar Papers
No similar papers found.