Conformal-Style Quantile Analyses for Stochastic Bandits

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
This work addresses the limitation of traditional multi-armed bandit algorithms, which focus on mean rewards and fail to optimize upper-tail performance such as high quantile outcomes. The authors propose ACP-UCB1, the first strategy to integrate adaptive conformal prediction into quantile multi-armed bandits. By dynamically estimating upper quantiles and incorporating a UCB-style optimism bonus under a fixed miscoverage level, ACP-UCB1 effectively optimizes upper-tail objectives. The method leverages concentration inequalities for reward–quantile pairs, perturbation analysis, and adaptive confidence levels to achieve a logarithmic upper-quantile regret bound, with per-arm regret scaling as \(O(\log n / \Delta_j^{\text{ACP}})\). Empirical results demonstrate that ACP-UCB1 significantly outperforms the standard UCB1 algorithm.
📝 Abstract
Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(α\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1}(1-α/2)\) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution \(O(\nicefrac{\log n}{Δ_j^{\mathrm{ACP}}})\). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.
Problem

Research questions and friction points this paper is trying to address.

stochastic bandits
upper-tail performance
quantile regret
conformal prediction
multi-armed bandits
Innovation

Methods, ideas, or system contributions that make the work stand out.

conformal prediction
quantile regret
stochastic bandits
upper-tail optimization
adaptive confidence level