🤖 AI Summary
This work addresses the challenge of learning near-optimal policies in partially observable Markov decision processes (POMDPs) using only finite observation-action histories. The authors propose a hyper-state MDP framework that enables efficient model estimation from a single trajectory and computes near-optimal finite-window policies via value iteration. A key theoretical contribution is the novel connection established between filter stability and concentration inequalities for weakly dependent random variables, which yields tight sample complexity guarantees for single-trajectory estimation in the hyper-state MDP. By integrating model-based reinforcement learning, hyper-state modeling, and analysis of non-independent sequences, the approach rigorously approximates high-performance policies in the original POMDP while maintaining strong theoretical foundations.
📝 Abstract
We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.