RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address low learning efficiency caused by experience homogenization in reinforcement learning, this paper proposes a human-inspired performance rating mechanism—the first to jointly model rating information for both reward learning and policy optimization. Methodologically, we design a rating-aware policy loss function to enhance sensitivity to low-performing experiences while improving robustness to high-performing ones; additionally, we introduce a weighted distribution-matching penalty and a rating-based sample reweighting scheme to establish a unified reward-policy optimization framework. Experiments across multiple benchmark environments demonstrate that our method accelerates convergence by 23%–37% and achieves significantly higher final performance than baseline approaches that use ratings solely for reward learning. These results validate the efficacy of differentially leveraging graded experiences in RL.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL), a common tool in decision making, learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. On the contrary, humans often learn to distinguish from different levels of performance and extract the underlying trends towards improving their decision making for best performance. Motivated by this, this paper proposes a novel RL method that mimics humans' decision making process by differentiating among collected experiences for effective policy learning. The main idea is to extract important directional information from experiences with different performance levels, named ratings, so that policies can be updated towards desired deviation from these experiences with different ratings. Specifically, we propose a new policy loss function that penalizes distribution similarities between the current policy and failed experiences with different ratings, and assign different weights to the penalty terms based on the rating classes. Meanwhile, reward learning from these rated samples can be integrated with the new policy loss towards an integrated reward and policy learning from rated samples. Optimizing the integrated reward and policy loss function will lead to the discovery of directions for policy improvement towards maximizing cumulative rewards and penalizing most from the lowest performance level while least from the highest performance level. To evaluate the effectiveness of the proposed method, we present results for experiments on a few typical environments that show improved convergence and overall performance over the existing rating-based reinforcement learning method with only reward learning.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Decision Making

Learning Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Experience Grading

Optimized Learning Rule

🔎 Similar Papers

Reinforcement Learning and Machine ethics:a systematic review