Quantile-Optimal Policy Learning under Unmeasured Confounding

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper studies α-quantile maximization for offline policy learning under unobserved confounding. The problem faces three fundamental challenges: the nonsmoothness of the quantile objective, failure of causal identification due to confounding bias, and insufficient coverage of offline data. To address them, we propose the first sample-efficient algorithm that innovatively integrates instrumental variable (IV) and negative control approaches for nonparametric identification, coupled with nonlinear integral equation solving and a pessimistic estimation framework for robust optimization. Theoretically, under mild coverage assumptions, the learned policy achieves an $ ilde{mathcal{O}}(n^{-1/2})$ convergence rate to the optimal quantile value—matching the minimax lower bound up to logarithmic factors—while ensuring strong statistical guarantees and computational tractability.

Technology Category

Application Category

📝 Abstract

We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$. We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax estimation approach with nonparametric models to solve these integral equations, and propose to construct conservative policy estimates that address (iii). The final policy is the one that maximizes these pessimistic estimates. In addition, we propose a novel regularized policy learning method that is more amenable to computation. Finally, we prove that the policies learned by these methods are $ ilde{mathscr{O}}(n^{-1/2})$ quantile-optimal under a mild coverage assumption on the offline dataset. Here, $ ilde{mathscr{O}}(cdot)$ omits poly-logarithmic factors. To the best of our knowledge, we propose the first sample-efficient policy learning algorithms for estimating the quantile-optimal policy when there exist unmeasured confounding.

Problem

Research questions and friction points this paper is trying to address.

Learning optimal policies with unmeasured confounding in rewards

Estimating quantile objectives using causal inference tools

Addressing insufficient coverage in offline datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses instrumental variables for unmeasured confounding

Solves nonlinear functional integral equations

Adopts minimax estimation with nonparametric models

🔎 Similar Papers

No similar papers found.