Multi-Armed Bandits with Minimum Aggregated Revenue Constraints

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This paper studies the fairness-aware optimization problem in contextual multi-armed bandits, aiming to maximize cumulative reward while ensuring each arm receives a minimum aggregate reward across all contexts. To address limitations of existing methods—including the absence of closed-form optimal solutions and failure to adapt to dynamic constraints—we propose an “optimistic-pessimistic” cooperative framework: optimism guides exploration, while pessimism enforces fairness constraints. We derive problem-dependent upper bounds on both regret and constraint violation, and establish, for the first time, that the time-horizon dependence of these bounds is optimal in general settings. Furthermore, we identify fundamental limitations of unconstrained exploration principles. The framework provides a provably guaranteed paradigm for real-world applications requiring context-aware decision-making and fairness assurance, such as resource allocation and online advertising.

Technology Category

Application Category

📝 Abstract

We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challenges -- particularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.

Problem

Research questions and friction points this paper is trying to address.

Ensuring minimum aggregated reward per arm across contexts

Maximizing total cumulative reward under fairness constraints

Addressing technical challenges from cross-context constraint aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-armed bandits with cross-context minimum reward constraints

Algorithms optimistically prioritizing performance or enforcing constraints

Derived upper bounds on regret and constraint violation

🔎 Similar Papers

Optimizing Contracts in Principal-Agent Team Production