Quantile-Optimal Policy Learning under Unmeasured Confounding

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies α-quantile maximization for offline policy learning under unobserved confounding. The problem faces three fundamental challenges: the nonsmoothness of the quantile objective, failure of causal identification due to confounding bias, and insufficient coverage of offline data. To address them, we propose the first sample-efficient algorithm that innovatively integrates instrumental variable (IV) and negative control approaches for nonparametric identification, coupled with nonlinear integral equation solving and a pessimistic estimation framework for robust optimization. Theoretically, under mild coverage assumptions, the learned policy achieves an $ ilde{mathcal{O}}(n^{-1/2})$ convergence rate to the optimal quantile value—matching the minimax lower bound up to logarithmic factors—while ensuring strong statistical guarantees and computational tractability.

Technology Category

Application Category

📝 Abstract
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$. We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax estimation approach with nonparametric models to solve these integral equations, and propose to construct conservative policy estimates that address (iii). The final policy is the one that maximizes these pessimistic estimates. In addition, we propose a novel regularized policy learning method that is more amenable to computation. Finally, we prove that the policies learned by these methods are $ ilde{mathscr{O}}(n^{-1/2})$ quantile-optimal under a mild coverage assumption on the offline dataset. Here, $ ilde{mathscr{O}}(cdot)$ omits poly-logarithmic factors. To the best of our knowledge, we propose the first sample-efficient policy learning algorithms for estimating the quantile-optimal policy when there exist unmeasured confounding.
Problem

Research questions and friction points this paper is trying to address.

Learning optimal policies with unmeasured confounding in rewards
Estimating quantile objectives using causal inference tools
Addressing insufficient coverage in offline datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses instrumental variables for unmeasured confounding
Solves nonlinear functional integral equations
Adopts minimax estimation with nonparametric models
🔎 Similar Papers
No similar papers found.
Zhongren Chen
Zhongren Chen
Yale University, Department of Statistics and Data Science
Causal InferenceLarge Language Model
S
Siyu Chen
Department of Statistics and Data Science, Yale University
Z
Zhengling Qi
Department of Decision Sciences, George Washington University
X
Xiaohong Chen
Cowles Foundation for Research in Economics, Yale University
Zhuoran Yang
Zhuoran Yang
Yale University
machine learningoptimizationreinforcement learningstatistics