🤖 AI Summary
This paper studies the distributed multi-armed bandit problem over erasure channels with communication constraints: a learner sends arm-selection commands to an agent via an erasure channel with erasure probability ε; upon erasure, the agent repeats the last successfully received arm pull, and the learner always observes the reward of the actually pulled arm. A central question is whether *erasure feedback*—i.e., agent-to-learner acknowledgment of successful command reception—improves the worst-case regret bound. We establish, for the first time, that erasure feedback does not alter the asymptotic order of the regret lower bound, which is tightly characterized as Ω(√(KT) + K/(1−ε)); it only affects constant factors. Building on this insight, we propose a feedback-aware algorithm that preserves the optimal asymptotic regret order while significantly reducing the leading constants. Both theoretical analysis and numerical experiments confirm its improved convergence behavior.
📝 Abstract
We study a distributed multi-armed bandit (MAB) problem over arm erasure channels, motivated by the increasing adoption of MAB algorithms over communication-constrained networks. In this setup, the learner communicates the chosen arm to play to an agent over an erasure channel with probability $epsilon in [0,1)$; if an erasure occurs, the agent continues pulling the last successfully received arm; the learner always observes the reward of the arm pulled. In past work, we considered the case where the agent cannot convey feedback to the learner, and thus the learner does not know whether the arm played is the requested or the last successfully received one. In this paper, we instead consider the case where the agent can send feedback to the learner on whether the arm request was received, and thus the learner exactly knows which arm was played. Surprisingly, we prove that erasure feedback does not improve the worst-case regret upper bound order over the previously studied no-feedback setting. In particular, we prove a regret lower bound of $Omega(sqrt{KT} + K / (1 - epsilon))$, where $K$ is the number of arms and $T$ the time horizon, that matches no-feedback upper bounds up to logarithmic factors. We note however that the availability of feedback enables simpler algorithm designs that may achieve better constants (albeit not better order) regret bounds; we design one such algorithm and evaluate its performance numerically.