Experimental Study on The Effect of Multi-step Deep Reinforcement Learning in POMDPs

📅 2022-09-12
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the applicability and robustness of multi-step deep reinforcement learning (RL) in partially observable Markov decision processes (POMDPs). We identify a critical limitation: standard MDP-based algorithms—including PPO, TD3, and SAC—exhibit severe performance degradation under partial observability due to state uncertainty-induced policy collapse. To address this, we first demonstrate that multi-step bootstrapping mitigates such collapse by improving temporal credit assignment despite missing state information. Building on this insight, we propose two novel variants—MTD3 and MSAC—that integrate multi-step return estimation into the TD3 and SAC frameworks, respectively. Comprehensive evaluation across three canonical POMDP benchmarks shows that MTD3 and MSAC achieve an average 23.7% performance gain over their baselines, improve convergence stability by 41%, and significantly enhance policy robustness. Our work establishes a scalable, robust multi-step optimization paradigm for deep RL in POMDP settings.
📝 Abstract
Deep Reinforcement Learning (DRL) has made tremendous advances in both simulated and real-world robot control tasks in recent years. This is particularly the case for tasks that can be carefully engineered with a full state representation, and which can then be formulated as a Markov Decision Process (MDP). However, applying DRL strategies designed for MDPs to novel robot control tasks can be challenging, because the available observations may be a partial representation of the state, resulting in a Partially Observable Markov Decision Process (POMDP). This paper considers three popular DRL algorithms, namely Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC), invented for MDPs, and studies their performance in POMDP scenarios. While prior work has found that SAC and TD3 typically outperform PPO across a broad range of tasks that can be represented as MDPs, we show that this is not always the case, using three representative POMDP environments. Empirical studies show that this is related to multi-step bootstrapping, where multi-step immediate rewards, instead of one-step immediate reward, are used to calculate the target value estimation of an observation and action pair. We identify this by observing that the inclusion of multi-step bootstrapping in TD3 (MTD3) and SAC (MSAC) results in improved robustness in POMDP settings.
Problem

Research questions and friction points this paper is trying to address.

Multi-step Deep Reinforcement Learning
Partially Observable Markov Decision Processes (POMDP)
Algorithm Comparison (PPO, TD3, SAC)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step Deep Reinforcement Learning
POMDP Tasks
Stability and Reliability Improvement
🔎 Similar Papers
No similar papers found.
L
Lingheng Meng
CSIRO Data61, Private Bag 10, Clayton South, VIC 3169, AU; Faculty of Engineering, Monash University, Clayton, VIC 3800, AU
R
R. Gorbet
Departments of Knowledge Integration and Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, CA
Michael Burke
Michael Burke
Monash University
Robot learningImitation learningIntelligent RoboticsMachine Learning
D
Dana Kuli'c
Faculty of Engineering, Monash University, Clayton, VIC 3800, AU