Group-Sensitive Offline Contextual Bandits

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Offline contextual bandits risk amplifying inter-group reward disparities due to data biases when leveraging historical data for policy optimization, thereby compromising fairness. This paper proposes a group-sensitive fair offline policy optimization framework that explicitly models inter-group reward disparity—either as a hard constraint (ensuring disparity ≤ δ) or as a soft minimization objective—a novel formulation in offline bandit learning, with theoretical convergence guarantees. Our method employs a doubly robust estimator to improve the accuracy of disparity estimation and integrates off-policy gradient optimization for efficient policy learning. Experiments across multiple synthetic and real-world datasets demonstrate that our approach significantly reduces inter-group reward disparity (average reduction of 37%) while preserving near-optimal overall policy performance (e.g., expected reward), effectively balancing fairness and utility.

Technology Category

Application Category

📝 Abstract

Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.

Problem

Research questions and friction points this paper is trying to address.

Reducing group-wise reward disparities in offline contextual bandits

Constraining reward disparity within user-defined thresholds during optimization

Maintaining competitive overall performance while ensuring group fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-sensitive fairness constraints in offline bandits

Doubly robust estimator for reward disparity estimation

Constrained off-policy gradient optimization framework

🔎 Similar Papers

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare