Pessimistic Risk-Aware Policy Learning in Contextual Bandits

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limited generalization capability of existing methods in complex scenarios by proposing a novel framework based on adaptive feature fusion and dynamic inference. The approach effectively integrates local details and global semantic information through a multi-scale context-aware module and a learnable routing strategy, enabling the model to dynamically adjust its computational path during inference according to input content. Experimental results demonstrate that the proposed model significantly outperforms current state-of-the-art methods across multiple benchmark datasets, achieving a favorable trade-off between accuracy and computational efficiency. The primary contribution lies in the design of a lightweight yet versatile dynamic fusion architecture, offering a new perspective for enhancing model generalization.

📝 Abstract

We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization, indicating that optimizing general Lipschitz risk criteria incurs no additional statistical cost relative to the expected-reward.

Problem

Research questions and friction points this paper is trying to address.

risk-aware policy learning

offline contextual bandits

risk criteria

distributional policy optimization

high-stakes decision making

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributional policy learning

risk-aware optimization

importance sampling