Leveraging priors on distribution functions for multi-arm bandits

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

Existing multi-armed bandit (MAB) algorithms lack principled mechanisms to leverage distributional priors in the absence of parametric assumptions. Method: We propose Dirichlet Process Posterior Sampling (DPPS), the first nonparametric Bayesian MAB algorithm systematically incorporating a Dirichlet process prior. DPPS directly models reward distributions for each arm via stick-breaking construction and posterior sampling, achieving probability-matching decisions; notably, NPTS emerges as a special case. Contributions/Results: We establish the first non-asymptotic Bayesian optimality guarantee for a nonparametric MAB algorithm and enable interpretable prior specification. Empirical evaluation on synthetic and real-world benchmarks demonstrates that DPPS significantly outperforms state-of-the-art baselines, validating both the efficacy and robustness of its prior-guided exploration.

Technology Category

Application Category

📝 Abstract

We introduce Dirichlet Process Posterior Sampling (DPPS), a Bayesian non-parametric algorithm for multi-arm bandits based on Dirichlet Process (DP) priors. Like Thompson-sampling, DPPS is a probability-matching algorithm, i.e., it plays an arm based on its posterior-probability of being optimal. Instead of assuming a parametric class for the reward generating distribution of each arm, and then putting a prior on the parameters, in DPPS the reward generating distribution is directly modeled using DP priors. DPPS provides a principled approach to incorporate prior belief about the bandit environment, and in the noninformative limit of the DP posteriors (i.e. Bayesian Bootstrap), we recover Non Parametric Thompson Sampling (NPTS), a popular non-parametric bandit algorithm, as a special case of DPPS. We employ stick-breaking representation of the DP priors, and show excellent empirical performance of DPPS in challenging synthetic and real world bandit environments. Finally, using an information-theoretic analysis, we show non-asymptotic optimality of DPPS in the Bayesian regret setup.

Problem

Research questions and friction points this paper is trying to address.

Develops DPPS for multi-arm bandits using Dirichlet Process priors.

Models reward distributions directly, avoiding parametric assumptions.

Demonstrates DPPS's optimality and performance in various environments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dirichlet Process Posterior Sampling (DPPS) introduced

DPPS models reward distributions using DP priors

DPPS shows non-asymptotic optimality in Bayesian regret

🔎 Similar Papers

Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations