Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems

📅 2023-12-05
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This paper addresses average-reward reinforcement learning for countable-state Markov decision processes (MDPs) with potentially unstable policies, where the stationary distribution belongs to an exponential family parameterized by the policy. Method: We propose the Score-Aware Gradient Estimator (SAGE), a value-function-free policy gradient estimator that directly exploits the exponential-family structure of the stationary distribution—bypassing the conventional actor-critic reliance on value function approximation. Contribution/Results: Theoretically, under non-convexity and infinite state spaces, we establish convergence guarantees via local Lyapunov conditions and Hessian non-degeneracy. Empirically, on multi-class product-form stochastic networks and queueing systems, SAGE achieves significantly faster training convergence to near-optimal policies compared to standard actor-critic methods, thereby validating both theoretical soundness and practical efficacy.
📝 Abstract
In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function approximation in the aforementioned setting. This contrasts with other common policy-gradient algorithms such as actor-critic methods. We first show that policy-gradient with SAGE locally converges, including in cases when the objective function is nonconvex, presents multiple maximizers, and the state space of the MDP is not finite. Under appropriate assumptions such as starting sufficiently close to a maximizer, the policy under stochastic gradient ascent with SAGE has an overwhelming probability of converging to the associated optimal policy. Other key assumptions are that a local Lyapunov function exists, and a nondegeneracy property of the Hessian of the objective function holds locally around a maximizer. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method. We specifically focus on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics, where parametrizable exponential families are commonplace. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.
Problem

Research questions and friction points this paper is trying to address.

Improving policy-gradient methods for model-based reinforcement learning
Estimating gradients without value-function approximation in MDPs
Ensuring policy convergence using local Lyapunov stability analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Score-aware gradient estimators replace value-function estimation
Policy-gradient method uses stationary distributions from MDPs
Local Lyapunov stability ensures convergence to optimal policies
🔎 Similar Papers
No similar papers found.