🤖 AI Summary
In socially sensitive applications of multi-armed bandits, simultaneously ensuring privacy protection and decision fairness remains challenging. Method: This paper proposes the first online learning framework that unifies differential privacy and Nash fairness. It operates without prior knowledge of horizon length and supports both centralized and local privacy models, achieving ε-differential privacy and optimal-order Nash regret. The approach integrates a differentially private incentive mechanism with a Nash-aware upper confidence bound, jointly optimizing average regret and an inequity penalty term. Contribution/Results: Theoretical analysis and synthetic experiments demonstrate that, under strong privacy guarantees (ε ≤ 1), our method significantly reduces Nash regret compared to existing baselines, thereby enhancing the Pareto efficiency of fairness–utility trade-offs.
📝 Abstract
Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns.
To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)-a novel and unified algorithmic framework that simultaneously ensures $ε$-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications.