Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses performance degradation in offline reinforcement learning caused by Q-value extrapolation. We first identify the root cause: linear Q-functions induce catastrophic errors in out-of-distribution regions due to unbounded, non-monotonic growth. To resolve this without conservative constraints, we propose PARS—a novel algorithm integrating Reward Scaling and Layer Normalization (RS-LN) with Penalized Actions (PA). RS-LN suppresses spurious Q-value inflation beyond the behavior dataset’s support, while PA explicitly penalizes unseen actions to improve policy safety and robustness. Evaluated on the D4RL benchmark, PARS consistently outperforms state-of-the-art methods. Notably, on the challenging AntMaze Ultra task, it achieves substantial gains in offline policy performance. Moreover, when fine-tuned online, PARS converges faster and attains higher final returns, demonstrating superior generalization and training stability.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

Problem

Research questions and friction points this paper is trying to address.

Mitigates Q-value extrapolation errors in offline RL

Penalizes infeasible actions to guide Q-values

Improves performance in offline and online RL tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Guiding Q-value decrease outside data range

Reward scaling with layer normalization (RS-LN)

Penalization mechanism for infeasible actions (PA)

🔎 Similar Papers

Offline Hierarchical Reinforcement Learning via Inverse Optimization