extsc{rfPG}: Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

In hidden-model partially observable Markov decision processes (HM-POMDPs), the true environment model is unknown, necessitating robust decision-making across an uncertain model set. Method: We propose the first framework integrating formal verification with optimization: it symbolically extracts a worst-case POMDP from the model uncertainty set, then applies finite-memory policy parameterization coupled with robust subgradient ascent to optimize for guaranteed performance. Contribution/Results: Our approach scales to HM-POMDPs with over 100,000 states—significantly outperforming baselines—while providing provable robustness guarantees. The learned policies maintain certified performance lower bounds even under the worst-case model and exhibit strong cross-model generalization. Crucially, we embed worst-case verification directly into the gradient-based optimization loop, achieving, for the first time in POMDP robust policy learning, a rigorous balance between formal correctness and computational tractability.

Technology Category

Application Category

📝 Abstract

Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.

Problem

Research questions and friction points this paper is trying to address.

Robust policies for hidden-model POMDPs under uncertainty

Tractable evaluation via worst-case POMDP verification

Scalable optimization for large HM-POMDP sets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines deductive verification for robust policy evaluation

Uses subgradient ascent for worst-case POMDP optimization

Scales to HM-POMDPs with over 100,000 environments

🔎 Similar Papers

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs