๐ค AI Summary
This work addresses performance degradation in offline reinforcement learning caused by Q-value extrapolation. We first identify the root cause: linear Q-functions induce catastrophic errors in out-of-distribution regions due to unbounded, non-monotonic growth. To resolve this without conservative constraints, we propose PARSโa novel algorithm integrating Reward Scaling and Layer Normalization (RS-LN) with Penalized Actions (PA). RS-LN suppresses spurious Q-value inflation beyond the behavior datasetโs support, while PA explicitly penalizes unseen actions to improve policy safety and robustness. Evaluated on the D4RL benchmark, PARS consistently outperforms state-of-the-art methods. Notably, on the challenging AntMaze Ultra task, it achieves substantial gains in offline policy performance. Moreover, when fine-tuned online, PARS converges faster and attains higher final returns, demonstrating superior generalization and training stability.
๐ Abstract
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.