🤖 AI Summary
This paper studies the combinatorial multi-armed bandit (CMAB) problem with stochastic submodular expected rewards and delayed composite anonymous feedback—addressing practical challenges including feedback aliasing, indistinguishable sources, and delayed arrival. We propose an online learning algorithm based on greedy sampling and decoupled estimation of delayed feedback. For the first time, we establish a unified regret bound of $ ilde{O}(T^{2/3} + T^{1/3}
u)$ under three general delay models: bounded adversarial, stochastically independent, and stochastically conditionally independent delays—revealing a universal additive impact of delay on performance. Theoretically, this bound strictly improves upon existing full-feedback delay methods. Empirically, our algorithm significantly reduces cumulative regret on both synthetic benchmarks and real-world submodular tasks, including influence maximization.
📝 Abstract
This paper investigates the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and full-bandit delayed feedback, where the delayed feedback is assumed to be composite and anonymous. In other words, the delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components. Three models of delayed feedback: bounded adversarial, stochastic independent, and stochastic conditionally independent are studied, and regret bounds are derived for each of the delay models. Ignoring the problem dependent parameters, we show that regret bound for all the delay models is $ ilde{O}(T^{2/3} + T^{1/3}
u)$ for time horizon $T$, where $
u$ is a delay parameter defined differently in the three cases, thus demonstrating an additive term in regret with delay in all the three delay models. The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.