Pessimistic Auxiliary Policy for Offline Reinforcement Learning

πŸ“… 2026-02-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Offline reinforcement learning is prone to value overestimation and error accumulation due to out-of-distribution actions. To address this issue, this work proposes a pessimistic auxiliary policy based on the lower confidence bound of the Q-function, which samples actions within the neighborhood of the current policy that exhibit both high estimated returns and low uncertainty. This approach effectively mitigates value overestimation and prevents error propagation during training. Empirical evaluations demonstrate that the proposed method significantly outperforms existing algorithms across multiple standard offline reinforcement learning benchmarks, highlighting its effectiveness and broad applicability.

Technology Category

Application Category

πŸ“ Abstract
Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
out-of-distribution actions
approximation error
error accumulation
overestimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

pessimistic auxiliary policy
offline reinforcement learning
lower confidence bound
error accumulation
out-of-distribution actions
πŸ”Ž Similar Papers
No similar papers found.