Distributional Off-policy Evaluation with Bellman Residual Minimization

📅 2024-02-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This paper addresses **return distribution estimation** in off-policy evaluation (OPE), i.e., accurately estimating the full return distribution of a target policy using offline data generated by a behavioral policy. To this end, we propose **Energy-Based Bellman Residual Minimization (EBRM)**, which replaces the intractable supremum distance with a computationally feasible **expected extended Wasserstein distance**, and provide the first rigorous theoretical justification for its validity. EBRM does not require completeness assumptions on the policy class. We further introduce a **multi-step temporal unfolding variant**, which substantially tightens the error bound under non-realizable settings. Our theoretical analysis establishes finite-sample error bounds for EBRM under both realizable and non-realizable regimes, balancing statistical efficiency and modeling robustness. Collectively, this work delivers a more practical and robust theoretical and algorithmic framework for distributional OPE.

Technology Category

Application Category

📝 Abstract

We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.

Problem

Research questions and friction points this paper is trying to address.

Estimating return distribution for target policies using off-policy data

Overcoming limitations of supremum-extended statistical distance methods

Providing theoretical guarantees without requiring completeness assumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy Bellman Residual Minimizer method

Expectation-extended statistical distances approach

Multi-step extension for error improvement

🔎 Similar Papers

Efficient Multi-Policy Evaluation for Reinforcement Learning