Reducing Reward Dependence in RL Through Adaptive Confidence Discounting

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

In reinforcement learning settings where human feedback is costly or reward computation is expensive, frequent querying of true rewards severely limits scalability. Method: This paper proposes an adaptive-confidence-based sparse reward querying mechanism that dynamically estimates the uncertainty of state-action value functions; it queries the true reward only when confidence falls below a threshold, otherwise relying on a learned reward model as a surrogate. Contribution/Results: The approach achieves the first explicit decoupling of reward dependency from policy performance, substantially reducing real reward queries while preserving policy quality. Integrated with uncertainty quantification, conditional reward modeling, and online policy updates, it forms a lightweight hybrid RL paradigm. Empirical evaluation across multiple tasks shows that the method attains cumulative return and convergence speed comparable to full-reward baselines, while reducing true reward calls to just 20%.

Technology Category

Application Category

📝 Abstract

In human-in-the-loop reinforcement learning or environments where calculating a reward is expensive, the costly rewards can make learning efficiency challenging to achieve. The cost of obtaining feedback from humans or calculating expensive rewards means algorithms receiving feedback at every step of long training sessions may be infeasible, which may limit agents' abilities to efficiently improve performance. Our aim is to reduce the reliance of learning agents on humans or expensive rewards, improving the efficiency of learning while maintaining the quality of the learned policy. We offer a novel reinforcement learning algorithm that requests a reward only when its knowledge of the value of actions in an environment state is low. Our approach uses a reward function model as a proxy for human-delivered or expensive rewards when confidence is high, and asks for those explicit rewards only when there is low confidence in the model's predicted rewards and/or action selection. By reducing dependence on the expensive-to-obtain rewards, we are able to learn efficiently in settings where the logistics or expense of obtaining rewards may otherwise prohibit it. In our experiments our approach obtains comparable performance to a baseline in terms of return and number of episodes required to learn, but achieves that performance with as few as 20% of the rewards.

Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on costly human or expensive rewards in RL.

Improves learning efficiency by minimizing reward requests.

Uses model confidence to determine when to request rewards.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive confidence discounting reduces reward dependence.

Reward function model proxies expensive rewards.

Requests rewards only when model confidence is low.

🔎 Similar Papers

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning