🤖 AI Summary
In hidden-model partially observable Markov decision processes (HM-POMDPs), the true environment model is unknown, necessitating robust decision-making across an uncertain model set. Method: We propose the first framework integrating formal verification with optimization: it symbolically extracts a worst-case POMDP from the model uncertainty set, then applies finite-memory policy parameterization coupled with robust subgradient ascent to optimize for guaranteed performance. Contribution/Results: Our approach scales to HM-POMDPs with over 100,000 states—significantly outperforming baselines—while providing provable robustness guarantees. The learned policies maintain certified performance lower bounds even under the worst-case model and exhibit strong cross-model generalization. Crucially, we embed worst-case verification directly into the gradient-based optimization loop, achieving, for the first time in POMDP robust policy learning, a rigorous balance between formal correctness and computational tractability.
📝 Abstract
Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.