🤖 AI Summary
For robust partially observable Markov decision processes (POMDPs) under model uncertainty, this paper proposes the Pessimistic Iterative Planning (PIP) framework to learn memory-based policies robust against worst-case model perturbations. PIP decomposes robust policy optimization into two alternating iterative stages: selecting the worst-case deterministic POMDP from the uncertainty set, followed by learning a finite-state controller (FSC) for it. To realize this, we introduce rFSCNet—a fully differentiable, end-to-end trainable algorithm that models controller state transitions via RNNs and incorporates auxiliary supervision signals to improve convergence and robustness. Experiments across four benchmark domains demonstrate that rFSCNet significantly outperforms mainstream baselines and matches the robustness of current state-of-the-art solvers. Notably, it establishes the first scalable, neural-network-based paradigm for joint optimization of pessimistic FSCs, enabling principled robust planning in high-dimensional, partially observable settings.
📝 Abstract
Robust POMDPs extend classical POMDPs to handle model uncertainty. Specifically, robust POMDPs exhibit so-called uncertainty sets on the transition and observation models, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between two main steps: (1) selecting a pessimistic (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this pessimistic POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next pessimistic POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network by using supervision policies optimized for the pessimistic POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against several baseline methods and competitive performance compared to a state-of-the-art robust POMDP solver.