🤖 AI Summary
This work addresses the lack of theoretical characterization of policy convergence dynamics in existing reinforcement learning methods under reachability specifications. The authors propose a novel approach grounded in the Probably Approximately Correct (PAC) learning framework, which iteratively estimates critical unknown parameters of the Markov decision process—such as minimal transition probabilities—to progressively satisfy PAC conditions and ultimately converge to an exactly optimal policy in the limit. The method guarantees effective approximation of the optimal policy within finite time and, for the first time, reveals the asymptotic behavior of policy convergence. Empirical evaluations on standard benchmarks corroborate the theoretical predictions, demonstrating both asymptotic optimality and interpretable convergence dynamics.
📝 Abstract
Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, this approach provides limited insight into convergence dynamics. In this work, we present an alternative approach that provides deeper theoretical insights into convergence. Our approach builds on PAC learning with assumptions. PAC learning guarantees near-optimal policies with high confidence in finite time but requires knowing internal MDP parameters like minimum transition probability. We argue that while these parameters are unknown in RL, they can be iteratively refined and estimated with increasing accuracy. By iteratively satisfying PAC conditions, we show that exact optimality can be achieved in the limit. Empirical evaluations on standard benchmarks validate our theoretical insights into convergence dynamics.