Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning, efficiently evaluating the influence of individual experience replay samples on policy performance remains challenging; conventional leave-one-out (LOO) estimation is computationally infeasible due to its requirement of repeated policy retraining. To address this, we propose Policy Iteration with Turn-over Dropout (PIToD), the first method that integrates policy iteration with experience-level zero-out dropout perturbations. Leveraging gradient sensitivity analysis and Monte Carlo approximation, PIToD enables interpretable, high-efficiency influence estimation without retraining. Empirically, PIToD achieves over 100× speedup versus LOO while preserving strong rank correlation (Spearman ρ > 0.92). When applied to filter out negatively influential transitions, PIToD improves the average performance of suboptimal agents by 37.2% across multiple benchmark tasks, markedly enhancing the quality of replay data utilization.

Technology Category

Application Category

📝 Abstract
In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about how these experiences influence the agent's performance is valuable for various purposes, such as identifying experiences that negatively influence underperforming agents. One method for estimating the influence of experiences is the leave-one-out (LOO) method. However, this method is usually computationally prohibitive. In this paper, we present Policy Iteration with Turn-over Dropout (PIToD), which efficiently estimates the influence of experiences. We evaluate how accurately PIToD estimates the influence of experiences and its efficiency compared to LOO. We then apply PIToD to amend underperforming RL agents, i.e., we use PIToD to estimate negatively influential experiences for the RL agents and to delete the influence of these experiences. We show that RL agents' performance is significantly improved via amendments with PIToD.
Problem

Research questions and friction points this paper is trying to address.

Efficiently estimating influence of experiences in RL
Identifying negatively influential experiences for RL agents
Improving RL agent performance by amending experiences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Policy Iteration with Turn-over Dropout
Efficiently estimates experience influence
Amends agents by deleting negative experiences
T
Takuya Hiraoka
NEC Corporation
G
Guanquan Wang
The University of Tokyo
T
Takashi Onishi
NEC Corporation
Yoshimasa Tsuruoka
Yoshimasa Tsuruoka
The University of Tokyo
Natural Language ProcessingReinforcement LearningArtificial Intelligence for Games