🤖 AI Summary
This work addresses the challenges of offline policy evaluation in partially observable Markov decision processes (POMDPs), where reliance on full histories leads to the curse of dimensionality and exponentially growing estimation errors. The authors propose a novel coverage analysis framework grounded in belief-space metrics, introducing the intrinsic geometric structure of the belief space into offline POMDP learning for the first time. By leveraging the Lipschitz continuity of value-relevant functions over the belief space, the framework replaces traditional history-based coverage with belief coverage, substantially weakening coverage assumptions. This approach unifies and strengthens theoretical analyses across multiple algorithmic classes, yielding tighter error bounds and improved sample efficiency for both two-sample Bellman error minimization and memory-based future-dependent value functions (FDVFs), thereby effectively mitigating error amplification caused by long horizons and extended memory lengths.
📝 Abstract
In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that mitigate exponential blow ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case studies: the double sampling Bellman error minimization algorithm, and the memory based future dependent value functions (FDVF). In both cases, our coverage definition based on the belief space metric yields tighter bounds.