Semi-pessimistic Reinforcement Learning

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Offline reinforcement learning (RL) suffers from distributional shift and scarce reward annotations, leading to poor policy generalization. To address these challenges, we propose a semi-pessimistic RL framework that, for the first time, employs a lower bound on the reward function—rather than on the Q-function or transition model—for pessimistic estimation. We theoretically prove that this approach guarantees monotonic policy improvement even under purely unlabeled-data-driven training, with milder constraint requirements. Our method unifies model-free and model-based paradigms, leveraging large-scale unlabeled data via semi-supervised learning to jointly optimize reward modeling and conservative policy updates. Empirical evaluation across multiple standard offline RL benchmarks and a real-world adaptive deep brain stimulation application for Parkinson’s disease demonstrates that our approach significantly outperforms existing offline RL algorithms, achieving superior robustness, training stability, and clinical applicability.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected data. However, it faces challenges of distributional shift, where the learned policy may encounter unseen scenarios not covered in the offline data. Additionally, numerous applications suffer from a scarcity of labeled reward data. Relying on labeled data alone often leads to a narrow state-action distribution, further amplifying the distributional shift, and resulting in suboptimal policy learning. To address these issues, we first recognize that the volume of unlabeled data is typically substantially larger than that of labeled data. We then propose a semi-pessimistic RL method to effectively leverage abundant unlabeled data. Our approach offers several advantages. It considerably simplifies the learning process, as it seeks a lower bound of the reward function, rather than that of the Q-function or state transition function. It is highly flexible, and can be integrated with a range of model-free and model-based RL algorithms. It enjoys the guaranteed improvement when utilizing vast unlabeled data, but requires much less restrictive conditions. We compare our method with a number of alternative solutions, both analytically and numerically, and demonstrate its clear competitiveness. We further illustrate with an application to adaptive deep brain stimulation for Parkinson's disease.

Problem

Research questions and friction points this paper is trying to address.

Addresses distributional shift in offline RL policies

Overcomes scarcity of labeled reward data

Leverages abundant unlabeled data effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-pessimistic RL leverages unlabeled data

Simplifies learning via reward lower bound

Flexible integration with RL algorithms

🔎 Similar Papers

Can Learned Optimization Make Reinforcement Learning Less Difficult?