π€ AI Summary
This paper studies reinforcement learning for non-stationary Markov decision processes (MDPs) under infinite-horizon average reward, where both the reward function and transition dynamics evolve over time, subject to a total variation budget Ξ_T. To address the lack of theoretical guarantees in existing policy gradient methods, we propose the first model-free non-stationary natural actor-critic algorithm, NS-NAC, equipped with a restart-based exploration mechanism. Furthermore, we design BORL-NS-NACβa parameter-free variant that integrates the bandit-over-RL meta-learning framework with Lyapunov-function-based dynamic analysis, eliminating the need for prior knowledge of Ξ_T. Under the dynamic regret metric, our algorithms achieve an upper bound of $ ilde{O}(|S|^{1/2}|A|^{1/2}Delta_T^{1/6}T^{5/6})$, establishing the first class of non-stationary policy gradient methods with rigorous theoretical guarantees.
π Abstract
We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $Delta_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget $Delta_T$. We present a dynamic regret of $ ilde{mathscr O}(|S|^{1/2}|A|^{1/2}Delta_T^{1/6}T^{5/6})$ for both algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.