🤖 AI Summary
This paper addresses the challenge in contextual bandits where rewards follow heavy-tailed distributions or have unknown ranges, causing standard algorithms to incur regret scaling polynomially with the reward upper bound $R$. We propose the first general function-approximation algorithm for this setting based on the Catoni robust estimator. Methodologically, we introduce the Catoni estimator into contextual bandits for the first time, integrating variance-weighted regression with a novel peeling mechanism that requires no prior variance knowledge—yielding a regret bound depending only on cumulative variance, not reward range. Theoretical contributions include: (1) an optimal regret of $O(sqrt{sum sigma_t^2} + log(RT))$ when variances are known; (2) a variance-dominated bound under unknown variances, derived via fourth-moment assumptions; and (3) a matching lower bound establishing optimality. Empirical results demonstrate substantial improvements over existing methods in heavy-tailed regimes.
📝 Abstract
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.