Minimum Empirical Divergence for Sub-Gaussian Linear Bandits

📅 2024-10-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This paper addresses the sub-Gaussian linear multi-armed bandit problem by proposing LinMED, the first algorithm to extend the Minimum Empirical Divergence (MED) principle to the linear contextual setting. LinMED directly computes action sampling probabilities via a closed-form expression, enabling efficient exploration while preserving compatibility with off-policy evaluation. Theoretically, LinMED achieves a problem-dependent regret bound of $Oig(frac{d^2}{Delta}log^2 n cdot loglog nig)$, strictly improving upon OFUL; its overall regret attains the near-optimal rate $O(dsqrt{n})$. Empirically, LinMED matches the performance of state-of-the-art methods. Key contributions include: (i) the linearization of the MED principle, (ii) an analytically tractable randomized policy design, and (iii) a tighter problem-dependent regret analysis.

Technology Category

Application Category

📝 Abstract

We propose a novel linear bandit algorithm called LinMED (Linear Minimum Empirical Divergence), which is a linear extension of the MED algorithm that was originally designed for multi-armed bandits. LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities, unlike the popular randomized algorithm called linear Thompson sampling. Such a feature proves useful for off-policy evaluation where the unbiased evaluation requires accurately computing the sampling probability. We prove that LinMED enjoys a near-optimal regret bound of $dsqrt{n}$ up to logarithmic factors where $d$ is the dimension and $n$ is the time horizon. We further show that LinMED enjoys a $frac{d^2}{Delta}left(log^2(n) ight)logleft(log(n) ight)$ problem-dependent regret where $Delta$ is the smallest sub-optimality gap, which is lower than $frac{d^2}{Delta}log^3(n)$ of the standard algorithm OFUL (Abbasi-yadkori et al., 2011). Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.

Problem

Research questions and friction points this paper is trying to address.

Develops LinMED algorithm for linear bandits

Provides near-optimal regret bound analysis

Enables accurate off-policy evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LinMED algorithm extends MED for linear bandits

Closed-form computation of arm sampling probabilities

Near-optimal regret bound with logarithmic factors

🔎 Similar Papers

Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual Bandits