Reinforcement Learning from Multi-level and Episodic Human Feedback

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

Designing reward functions for reinforcement learning in unstructured environments remains challenging due to sparse and ambiguous task specifications. Method: This paper proposes a reward modeling approach based on multi-level, episode-wise human scoring feedback—departing from conventional binary preference comparisons. It introduces a global, non-Markovian episodic scoring mechanism and formulates a unified Bayesian inference and inverse reinforcement learning framework for end-to-end co-learning of reward functions and policies. Contribution/Results: To our knowledge, this is the first work to systematically integrate multi-level episodic feedback into reward modeling. We provide theoretical guarantees, proving a sublinear regret bound for the proposed algorithm. Empirical evaluation across diverse simulated robotic and navigation tasks demonstrates significant improvements in sample efficiency and policy performance, validating the efficacy of high-information-density scoring feedback over coarse-grained alternatives.

Technology Category

Application Category

📝 Abstract

Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback, expressed as a preference for one behavior over another, to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on the evaluation of an entire episode. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.

Problem

Research questions and friction points this paper is trying to address.

Designing effective reward functions for complex tasks in unstructured environments

Leveraging multi-level human feedback to refine reward functions

Learning non-Markovian rewards from episodic score-based feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes multi-level episodic human feedback

Learns reward function and optimal policy

Handles non-Markovian rewards effectively

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study