🤖 AI Summary
This paper addresses the failure of decision-making policies under concept drift caused by shifts in conditional distributions. To overcome the excessive conservatism of joint-distribution modeling, we propose the first policy learning framework explicitly designed for robustness against conditional distribution shifts. Methodologically, we introduce a doubly robust worst-case reward estimator and characterize policy class complexity via entropy integrals and Hamming distance; based on this, we formulate a distributionally robust optimization model grounded in perturbations of conditional distributions. Theoretically, we establish, for the first time, the optimal suboptimality gap convergence rate of (O(kappa(Pi) n^{-1/2})) and prove the asymptotic normality of our estimator. Empirical evaluations demonstrate that our approach significantly outperforms existing baselines across diverse concept drift scenarios.
📝 Abstract
Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem -- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $kappa(Pi)n^{-1/2}$, where $kappa(Pi)$ is the entropy integral of $Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.