Counterfactually Safe Reinforcement Learning

πŸ“… 2026-05-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
While reinforcement learning policies are often optimal in terms of population-level average performance, they may inflict unacceptable harm on individual agents. To address this issue, this work formally introduces a notion of individual harm from a counterfactual perspective and proposes a two-stage reinforcement learning framework that simultaneously maximizes expected return and controls the risk of individual harm. Theoretical analysis demonstrates that the proposed method guarantees a controllable harm rate under finite-sample settings and establishes a tight upper bound on policy suboptimality. Empirical evaluations on both synthetic and real-world datasets confirm the framework’s ability to effectively balance overall utility against individual safety.
πŸ“ Abstract
Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.
Problem

Research questions and friction points this paper is trying to address.

counterfactual safety
individual harm
reinforcement learning
safe policy
harm prevention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Safety
Individual Harm
Reinforcement Learning
Two-stage Policy Learning
Harm Rate Control
πŸ”Ž Similar Papers