Counterfactually Safe Reinforcement Learning

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

While reinforcement learning policies are often optimal in terms of population-level average performance, they may inflict unacceptable harm on individual agents. To address this issue, this work formally introduces a notion of individual harm from a counterfactual perspective and proposes a two-stage reinforcement learning framework that simultaneously maximizes expected return and controls the risk of individual harm. Theoretical analysis demonstrates that the proposed method guarantees a controllable harm rate under finite-sample settings and establishes a tight upper bound on policy suboptimality. Empirical evaluations on both synthetic and real-world datasets confirm the framework’s ability to effectively balance overall utility against individual safety.

📝 Abstract

Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.

Problem

Research questions and friction points this paper is trying to address.

counterfactual safety

individual harm

reinforcement learning

safe policy

harm prevention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Safety

Individual Harm

Reinforcement Learning