Natural Policy Gradient for Average Reward Non-Stationary RL

πŸ“… 2025-04-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper studies reinforcement learning for non-stationary Markov decision processes (MDPs) under infinite-horizon average reward, where both the reward function and transition dynamics evolve over time, subject to a total variation budget Ξ”_T. To address the lack of theoretical guarantees in existing policy gradient methods, we propose the first model-free non-stationary natural actor-critic algorithm, NS-NAC, equipped with a restart-based exploration mechanism. Furthermore, we design BORL-NS-NACβ€”a parameter-free variant that integrates the bandit-over-RL meta-learning framework with Lyapunov-function-based dynamic analysis, eliminating the need for prior knowledge of Ξ”_T. Under the dynamic regret metric, our algorithms achieve an upper bound of $ ilde{O}(|S|^{1/2}|A|^{1/2}Delta_T^{1/6}T^{5/6})$, establishing the first class of non-stationary policy gradient methods with rigorous theoretical guarantees.

Technology Category

Application Category

πŸ“ Abstract
We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $Delta_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget $Delta_T$. We present a dynamic regret of $ ilde{mathscr O}(|S|^{1/2}|A|^{1/2}Delta_T^{1/6}T^{5/6})$ for both algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.
Problem

Research questions and friction points this paper is trying to address.

Addressing non-stationary RL in average-reward settings
Proposing policy-based methods for dynamic environments
Achieving dynamic regret bounds without variation budget knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-free policy-based algorithm NS-NAC
Restart-based exploration for non-stationarity
Bandit-over-RL parameter-free BORL-NS-NAC
πŸ”Ž Similar Papers
No similar papers found.