Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergence

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses decentralized partially observable Markov decision processes (Dec-POMDPs) under computational constraints by proposing a finite-memory policy optimization method grounded in agent-state representations. By integrating risk-seeking objectives into a conservative policy iteration framework and leveraging an iterative best-response algorithm, the approach ensures monotonic policy improvement and local optimality while achieving convergence in polynomial time. To the best of our knowledge, this is the first method to simultaneously model risk preferences and enable efficient solving under finite memory constraints. Empirical evaluations on multiple benchmark tasks demonstrate performance close to optimal, significantly outperforming or matching state-of-the-art approaches. Moreover, the method’s performance can be further enhanced by expanding the agent-state representation.

Technology Category

Application Category

📝 Abstract

Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are policies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.

Problem

Research questions and friction points this paper is trying to address.

Dec-POMDP

agent-state policy

memory-constrained planning

risk-seeking optimization

policy iteration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dec-POMDP

agent-state policy

conservative policy iteration