Bayesian Off-Policy Evaluation and Learning for Large Action Spaces

📅 2024-02-22

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address low sample efficiency in offline policy evaluation (OPE) and offline policy learning (OPL) under large action spaces, this paper proposes sDM—a unified Bayesian framework that introduces structured prior modeling to OPE/OPL for the first time, explicitly capturing inter-action correlations. Methodologically, sDM designs a correlation-aware Bayesian metric, replacing conventional worst-case analysis to enable instance-averaged performance assessment; it further integrates online Bayesian multi-armed bandit heuristics to jointly optimize OPE and OPL. Theoretically, we prove that modeling action correlations significantly reduces estimation variance. Empirically, sDM achieves superior evaluation accuracy and policy performance over state-of-the-art baselines across multiple benchmark tasks, while maintaining linear time complexity.

Technology Category

Application Category

📝 Abstract

In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.

Problem

Research questions and friction points this paper is trying to address.

Enhances off-policy evaluation in large action spaces

Leverages action correlations for efficient learning

Introduces Bayesian metrics for average performance assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian framework with structured priors

sDM leverages action correlations efficiently

Bayesian metrics for average performance assessment

🔎 Similar Papers

Off-policy Evaluation with Deeply-abstracted States