🤖 AI Summary
This paper studies online learning with delayed feedback under capacity constraints—i.e., the system can retain at most $C$ rounds of historical feedback at any time. We introduce the first unified capacity-constrained model, subsuming delayed multi-armed bandits, label-efficient learning, and online scheduling. Our approach integrates Pareto-distributed proxy delays, dynamic batched scheduling, and Clairvoyant/Preemptible feedback mechanisms, coupled with information-theoretic lower bounds and adaptive tracking strategies. We precisely characterize the minimal required capacity as a function of delay structure. For $K$ actions, $T$ rounds, and total delay $D$, we derive tight regret bounds: $widetilde{Theta}ig(sqrt{TK + DK/C + Dlog K}ig)$ for bandit feedback and $widetilde{Theta}ig(sqrt{(D+T)log K}ig)$ for full-information feedback—demonstrating graceful regret degradation with increasing capacity $C$.
📝 Abstract
We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delay durations are revealed upfront each round) and/or ``preemptibility'' (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = Omega(log(T))$, we achieve regret $widetilde{Theta}(sqrt{TK + DK/C + Dlog(K)})$ for bandits and $widetilde{Theta}(sqrt{(D+T)log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{max}$, adding $smash{widetilde{O}(d_{max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $Thetaigl(sqrt{TK(1+d/C)+Tdlog(K)}igr)$ and the optimal capacity is $Theta(min{K/log(K),d}igr)$ in the bandit setting, while in the full-information setting, the minimax regret is $Thetaigl(sqrt{T(d+1)log(K)}igr)$ and the optimal capacity is $Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel scheduling policies, based on Pareto-distributed proxy delays and batching techniques. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.