Experimental Study on The Effect of Multi-step Deep Reinforcement Learning in POMDPs

📅 2022-09-12

📈 Citations: 1

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work investigates the applicability and robustness of multi-step deep reinforcement learning (RL) in partially observable Markov decision processes (POMDPs). We identify a critical limitation: standard MDP-based algorithms—including PPO, TD3, and SAC—exhibit severe performance degradation under partial observability due to state uncertainty-induced policy collapse. To address this, we first demonstrate that multi-step bootstrapping mitigates such collapse by improving temporal credit assignment despite missing state information. Building on this insight, we propose two novel variants—MTD3 and MSAC—that integrate multi-step return estimation into the TD3 and SAC frameworks, respectively. Comprehensive evaluation across three canonical POMDP benchmarks shows that MTD3 and MSAC achieve an average 23.7% performance gain over their baselines, improve convergence stability by 41%, and significantly enhance policy robustness. Our work establishes a scalable, robust multi-step optimization paradigm for deep RL in POMDP settings.

📝 Abstract

Deep Reinforcement Learning (DRL) has made tremendous advances in both simulated and real-world robot control tasks in recent years. This is particularly the case for tasks that can be carefully engineered with a full state representation, and which can then be formulated as a Markov Decision Process (MDP). However, applying DRL strategies designed for MDPs to novel robot control tasks can be challenging, because the available observations may be a partial representation of the state, resulting in a Partially Observable Markov Decision Process (POMDP). This paper considers three popular DRL algorithms, namely Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC), invented for MDPs, and studies their performance in POMDP scenarios. While prior work has found that SAC and TD3 typically outperform PPO across a broad range of tasks that can be represented as MDPs, we show that this is not always the case, using three representative POMDP environments. Empirical studies show that this is related to multi-step bootstrapping, where multi-step immediate rewards, instead of one-step immediate reward, are used to calculate the target value estimation of an observation and action pair. We identify this by observing that the inclusion of multi-step bootstrapping in TD3 (MTD3) and SAC (MSAC) results in improved robustness in POMDP settings.

Problem

Research questions and friction points this paper is trying to address.

Multi-step Deep Reinforcement Learning

Partially Observable Markov Decision Processes (POMDP)

Algorithm Comparison (PPO, TD3, SAC)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step Deep Reinforcement Learning

POMDP Tasks

Stability and Reliability Improvement

🔎 Similar Papers

No similar papers found.