Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of learning near-optimal policies in partially observable Markov decision processes (POMDPs) using only finite observation-action histories. The authors propose a hyper-state MDP framework that enables efficient model estimation from a single trajectory and computes near-optimal finite-window policies via value iteration. A key theoretical contribution is the novel connection established between filter stability and concentration inequalities for weakly dependent random variables, which yields tight sample complexity guarantees for single-trajectory estimation in the hyper-state MDP. By integrating model-based reinforcement learning, hyper-state modeling, and analysis of non-independent sequences, the approach rigorously approximates high-performance policies in the original POMDP while maintaining strong theoretical foundations.

Technology Category

Application Category

📝 Abstract

We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.

Problem

Research questions and friction points this paper is trying to address.

POMDP

finite-window policies

superstate MDP

model estimation

partial observability

Innovation

Methods, ideas, or system contributions that make the work stand out.

finite-window policies

superstate MDP

model-based learning