🤖 AI Summary
To address the high sample complexity and estimation inaccuracy of offline policy evaluation (OPE) in high-dimensional state spaces, this paper introduces state abstraction into the OPE framework for the first time. We propose a backward model-free abstraction condition tailored to OPE and construct a time-reversed Markov decision process (MDP) based on it. Building upon this, we design an iterative deep abstraction algorithm that guarantees Fisher consistency of mainstream OPE estimators—such as (marginal) importance sampling—in the abstracted state space. Theoretically, our method substantially reduces the sample complexity of OPE, and the abstraction procedure is agnostic to both the target policy and the environment dynamics. Our core contributions are threefold: (i) establishing the first theoretical foundation for OPE-oriented state abstraction; (ii) introducing a verifiable backward abstraction condition; and (iii) unifying statistical consistency with computational efficiency in OPE.
📝 Abstract
Off-policy evaluation (OPE) is crucial for assessing a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging. This paper studies state abstractions -- originally designed for policy learning -- in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE, and derive a backward-model-irrelevance condition for achieving irrelevance in %sequential and (marginalized) importance sampling ratios by constructing a time-reversed Markov decision process (MDP). (ii) We propose a novel iterative procedure that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplifies the sample complexity of OPE arising from high cardinality. (iii) We prove the Fisher consistencies of various OPE estimators when applied to our proposed abstract state spaces.