π€ AI Summary
While reinforcement learning policies are often optimal in terms of population-level average performance, they may inflict unacceptable harm on individual agents. To address this issue, this work formally introduces a notion of individual harm from a counterfactual perspective and proposes a two-stage reinforcement learning framework that simultaneously maximizes expected return and controls the risk of individual harm. Theoretical analysis demonstrates that the proposed method guarantees a controllable harm rate under finite-sample settings and establishes a tight upper bound on policy suboptimality. Empirical evaluations on both synthetic and real-world datasets confirm the frameworkβs ability to effectively balance overall utility against individual safety.
π Abstract
Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.