🤖 AI Summary
This paper studies α-quantile maximization for offline policy learning under unobserved confounding. The problem faces three fundamental challenges: the nonsmoothness of the quantile objective, failure of causal identification due to confounding bias, and insufficient coverage of offline data. To address them, we propose the first sample-efficient algorithm that innovatively integrates instrumental variable (IV) and negative control approaches for nonparametric identification, coupled with nonlinear integral equation solving and a pessimistic estimation framework for robust optimization. Theoretically, under mild coverage assumptions, the learned policy achieves an $ ilde{mathcal{O}}(n^{-1/2})$ convergence rate to the optimal quantile value—matching the minimax lower bound up to logarithmic factors—while ensuring strong statistical guarantees and computational tractability.
📝 Abstract
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$. We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax estimation approach with nonparametric models to solve these integral equations, and propose to construct conservative policy estimates that address (iii). The final policy is the one that maximizes these pessimistic estimates. In addition, we propose a novel regularized policy learning method that is more amenable to computation. Finally, we prove that the policies learned by these methods are $ ilde{mathscr{O}}(n^{-1/2})$ quantile-optimal under a mild coverage assumption on the offline dataset. Here, $ ilde{mathscr{O}}(cdot)$ omits poly-logarithmic factors. To the best of our knowledge, we propose the first sample-efficient policy learning algorithms for estimating the quantile-optimal policy when there exist unmeasured confounding.