Policy Learning with Abstention

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses safety risks in high-stakes personalized decision-making arising from policy learning models that are forced to produce outputs regardless of prediction uncertainty. To mitigate this, we propose a *policy learning framework with abstention*, enabling the model to deliberately abstain—deferring to a safe default policy or expert intervention—when predictive uncertainty is high. Methodologically, we design a two-stage learning framework: first, we construct an abstention rule based on disagreement among approximately optimal policies; second, we extend it to marginal conditional modeling, distributionally robust optimization, and safe policy improvement. We employ a doubly robust objective to handle unknown propensity scores and incorporate an O(1/n) regret bound analysis alongside a stochastic reward compensation mechanism. Theoretically, we establish fast-converging regret bounds under both known and unknown propensity scores. Our approach significantly enhances the safety, robustness, and practicality of policy learning in critical applications.

Technology Category

Application Category

📝 Abstract
Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.
Problem

Research questions and friction points this paper is trying to address.

Develops abstention policies to defer decisions under uncertainty
Establishes regret guarantees for known and unknown propensity cases
Applies abstention to improve robustness and safety in policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage learner identifies near-optimal policies
Constructs abstention rules from policy disagreements
Uses doubly robust objective for unknown propensities