Offline Policy Learning with Weight Clipping and Heaviside Composite Optimization

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high variance in reweighted estimators arising from excessively small propensity scores in offline policy learning. To mitigate this issue, the authors propose a novel weight-clipping algorithm that adaptively selects the clipping threshold by minimizing the mean squared error of the policy value estimator. The resulting bilevel discontinuous optimization problem is reformulated as a Heaviside-composite optimization problem and efficiently solved via asymptotic integer programming. Theoretical analysis provides an upper bound on policy suboptimality, ensuring robust performance guarantees. Empirical results demonstrate that the proposed method substantially reduces estimation variance and improves the practical performance of learned policies. Overall, this study offers a rigorous and computationally tractable optimization framework for offline policy learning with discontinuous objectives.

Technology Category

Application Category

📝 Abstract
Offline policy learning aims to use historical data to learn an optimal personalized decision rule. In the standard estimate-then-optimize framework, reweighting-based methods (e.g., inverse propensity weighting or doubly robust estimators) are widely used to produce unbiased estimates of policy values. However, when the propensity scores of some treatments are small, these reweighting-based methods suffer from high variance in policy value estimation, which may mislead the downstream policy optimization and yield a learned policy with inferior value. In this paper, we systematically develop an offline policy learning algorithm based on a weight-clipping estimator that truncates small propensity scores via a clipping threshold chosen to minimize the mean squared error (MSE) in policy value estimation. Focusing on linear policies, we address the bilevel and discontinuous objective induced by weight-clipping-based policy optimization by reformulating the problem as a Heaviside composite optimization problem, which provides a rigorous computational framework. The reformulated policy optimization problem is then solved efficiently using the progressive integer programming method, making practical policy learning tractable. We establish an upper bound for the suboptimality of the proposed algorithm, which reveals how the reduction in MSE of policy value estimation, enabled by our proposed weight-clipping estimator, leads to improved policy learning performance.
Problem

Research questions and friction points this paper is trying to address.

offline policy learning
propensity score
high variance
policy value estimation
weight clipping
Innovation

Methods, ideas, or system contributions that make the work stand out.

weight clipping
Heaviside composite optimization
offline policy learning
propensity score
mean squared error
🔎 Similar Papers
No similar papers found.