Deontically Constrained Policy Improvement in Reinforcement Learning Agents

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning decision policies in reinforcement learning that simultaneously maximize utility and satisfy deontic constraints (e.g., ethical norms). To overcome the limitation of standard MDPs—which lack formal semantics for “ought-to-do” or “forbidden” statements—the paper proposes a controlled MDP semantic framework grounded in Expected-Act Utilitarian Deontic Logic, integrating deontic logic with probabilistic STIT logic to enable verifiable constraint modeling. It further introduces a constraint-aware two-level policy improvement algorithm that jointly optimizes utility and enforces obligations. Theoretically, the algorithm is proven to converge to a locally optimal policy satisfying all specified deontic constraints. Empirical evaluation on canonical MDP benchmarks demonstrates that the method strictly avoids prohibited actions while retaining task performance nearly comparable to unconstrained baselines.

Technology Category

Application Category

📝 Abstract
Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community. An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action. A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function. This paper considers the problem of learning a decision policy that maximizes utility subject to satisfying a constraint expressed in deontic logic. In this setup, the utility captures the agent's mission - such as going quickly from A to B. The deontic formula represents (ethical, social, situational) constraints on how the agent might achieve its mission by prohibiting classes of behaviors. We use the logic of Expected Act Utilitarianism, a probabilistic stit logic that can be interpreted over controlled MDPs. We develop a variation on policy improvement, and show that it reaches a constrained local maximum of the mission utility. Given that in stit logic, an agent's duty is derived from value maximization, this can be seen as a way of acting to simultaneously maximize two value functions, one of which is implicit, in a bi-level structure. We illustrate these results with experiments on sample MDPs.
Problem

Research questions and friction points this paper is trying to address.

Maximizing utility while satisfying deontic logic constraints
Learning decision policies under ethical and situational constraints
Achieving bi-level value maximization in controlled MDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deontic logic constraints in RL policies
Bi-level value function maximization approach
Policy improvement for constrained MDPs
🔎 Similar Papers
No similar papers found.
A
Alena Makarova
School of Electrical Engineering and Computer Science, Oregon State University
Houssam Abbas
Houssam Abbas
Oregon State University
Cyber-physical systemsAI EthicsAutonomous SystemsFormal methodsControl theory