Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the online multi-reward, multi-policy joint evaluation problem—simultaneously estimating, with high confidence, the discounted value of multiple policies under multiple reward functions—a setting not systematically studied in existing PAC reinforcement learning literature. We first derive an instance-dependent lower bound on sample complexity. Then, we propose a scalable algorithm based on convex approximation and innovatively adapt the MR-NaS adaptive sampling mechanism to policy evaluation. Theoretically, we prove that our method achieves (ε,δ)-PAC guarantees in tabular MDPs while substantially reducing sample complexity compared to prior approaches. Empirically, it outperforms state-of-the-art baselines in both estimation accuracy and computational efficiency across diverse benchmark tasks.

Technology Category

Application Category

📝 Abstract

We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(epsilon,delta)$-PAC perspective to achieve $epsilon$-accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme.

Problem

Research questions and friction points this paper is trying to address.

Evaluate multiple reward functions simultaneously for different policies.

Achieve accurate estimates with high confidence across reward sets.

Minimize sample complexity for evaluating different policies efficiently.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive exploration scheme

Convex approximation method

Multi-reward policy evaluation

🔎 Similar Papers

Efficient Multi-Policy Evaluation for Reinforcement Learning