🤖 AI Summary
This paper addresses the online approximate solution of large-scale, dynamic, partially observable Markov decision processes (POMDPs). Methodologically, it introduces a novel anytime online planning algorithm that integrates deep future history sampling with progressive policy updates, operating within a reference policy programming framework to circumvent explicit numerical optimization. Theoretically, it establishes the first performance loss bound expressed in terms of the *mean* sampling error—rather than the conventional maximum—thereby significantly enhancing robustness and convergence stability under sparse sampling. Empirical evaluation demonstrates that the algorithm substantially outperforms state-of-the-art online POMDP solvers on high-dimensional dynamic tasks, including a challenging Corsican helicopter emergency rescue scenario requiring 150-step planning.
📝 Abstract
This paper proposes Partially Observable Reference Policy Programming, a novel anytime online approximate POMDP solver which samples meaningful future histories very deeply while simultaneously forcing a gradual policy update. We provide theoretical guarantees for the algorithm's underlying scheme which say that the performance loss is bounded by the average of the sampling approximation errors rather than the usual maximum, a crucial requirement given the sampling sparsity of online planning. Empirical evaluations on two large-scale problems with dynamically evolving environments -- including a helicopter emergency scenario in the Corsica region requiring approximately 150 planning steps -- corroborate the theoretical results and indicate that our solver considerably outperforms current online benchmarks.