Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality

📅 2024-05-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the low sample efficiency and difficulty in ensuring feasibility during policy training for constrained Markov decision processes (CMDPs) in high-stakes settings, this paper proposes Two-Stage Deep Decision Rules (TS-DDR). TS-DDR is the first method to integrate Lagrangian duality theory into an end-to-end policy training framework, jointly optimizing primal feasibility and dual convergence via deterministic forward solving of the constrained optimization subproblem and closed-form dual gradient backpropagation. By unifying stochastic gradient descent, deterministic optimization solvers, and neural network policy parameterization, it avoids optimality loss induced by convex relaxations. Evaluated on a real-world hydropower scheduling task in Bolivia, TS-DDR achieves significantly higher solution quality while reducing computational time by one to two orders of magnitude compared to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications, where decisions must optimize cumulative rewards while strictly adhering to complex nonlinear constraints. In domains such as power systems, finance, supply chains, and precision robotics, violating these constraints can result in significant financial or societal costs. Existing Reinforcement Learning (RL) methods often struggle with sample efficiency and effectiveness in finding feasible policies for highly and strictly constrained CMDPs, limiting their applicability in these environments. Stochastic dual dynamic programming is often used in practice on convex relaxations of the original problem, but they also encounter computational challenges and loss of optimality. This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS-DDR), to efficiently train parametric actor policies using Lagrangian Duality. TS-DDR is a self-supervised learning algorithm that trains general decision rules (parametric policies) using stochastic gradient descent (SGD); its forward passes solve {em deterministic} optimization problems to find feasible policies, and its backward passes leverage duality theory to train the parametric policy with closed-form gradients. TS-DDR inherits the flexibility and computational performance of deep learning methodologies to solve CMDP problems. Applied to the Long-Term Hydrothermal Dispatch (LTHD) problem using actual power system data from Bolivia, TS-DDR is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Optimizes rewards in constrained Markov decision processes.

Improves sample efficiency in strict constraint adherence.

Reduces computation time in power system applications.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Lagrangian Duality for training

Implements Two-Stage Deep Decision Rules

Applies stochastic gradient descent optimization

🔎 Similar Papers

Functional Acceleration for Policy Mirror Descent