Doubly Optimal Policy Evaluation for Reinforcement Learning

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
High variance in policy evaluation for long-horizon reinforcement learning tasks—stemming from suboptimal behavior policies and baseline designs—severely limits sample efficiency. To address this, we propose a “doubly optimal” joint optimization framework that simultaneously optimizes both the behavior policy and the control variate baseline, achieving strict variance reduction while preserving unbiasedness. Our method integrates importance sampling, control variates, and policy gradients, and establishes a theoretical analysis grounded in the bias–variance trade-off. We prove that our estimator’s variance is strictly lower than that of all existing optimal estimators. Empirically, on multiple continuous-control benchmark tasks, our approach reduces estimation variance significantly under identical sample budgets, improves evaluation accuracy by over 40%, and achieves state-of-the-art (SOTA) performance.

Technology Category

Application Category

📝 Abstract
Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.
Problem

Research questions and friction points this paper is trying to address.

Reduces variance in reinforcement learning policy evaluation
Combines optimal data-collecting policy and data-processing baseline
Ensures unbiased and lower variance than previous methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal data-collecting policy design
Optimal data-processing baseline integration
Unbiased, lower variance policy evaluation
🔎 Similar Papers
No similar papers found.