Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

๐Ÿ“… 2025-07-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses performance degradation in offline reinforcement learning caused by Q-value extrapolation. We first identify the root cause: linear Q-functions induce catastrophic errors in out-of-distribution regions due to unbounded, non-monotonic growth. To resolve this without conservative constraints, we propose PARSโ€”a novel algorithm integrating Reward Scaling and Layer Normalization (RS-LN) with Penalized Actions (PA). RS-LN suppresses spurious Q-value inflation beyond the behavior datasetโ€™s support, while PA explicitly penalizes unseen actions to improve policy safety and robustness. Evaluated on the D4RL benchmark, PARS consistently outperforms state-of-the-art methods. Notably, on the challenging AntMaze Ultra task, it achieves substantial gains in offline policy performance. Moreover, when fine-tuned online, PARS converges faster and attains higher final returns, demonstrating superior generalization and training stability.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
Problem

Research questions and friction points this paper is trying to address.

Mitigates Q-value extrapolation errors in offline RL
Penalizes infeasible actions to guide Q-values
Improves performance in offline and online RL tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Guiding Q-value decrease outside data range
Reward scaling with layer normalization (RS-LN)
Penalization mechanism for infeasible actions (PA)
๐Ÿ”Ž Similar Papers
No similar papers found.