🤖 AI Summary
This paper studies the online shortest path problem on directed acyclic graphs (DAGs) against an adaptive adversary: in each round, the learner selects a source-to-sink path and observes only the total loss (bandit feedback), aiming to minimize regret relative to the best fixed path over $T$ rounds. To tackle this strongly adversarial setting, we propose the first computationally efficient algorithm, whose core innovations are a novel edge-loss estimator and a high-probability analysis framework based on centroid decomposition—overcoming long-standing analytical bottlenecks for bandit feedback under adaptivity. Theoretically, our algorithm achieves a high-probability regret bound of $ ilde{O}(sqrt{|E| T log |X|})$, which is nearly minimax optimal. Moreover, our approach unifies and improves algorithmic guarantees for several fundamental combinatorial decision problems, including $m$-sets, extensive-form games, the Colonel Blotto game, and hypercube decision spaces.
📝 Abstract
In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{mathsf{s}}$ and a sink node $v_{mathsf{t}}$, let $X subseteq {0,1}^{|E|}$ denote the set of all paths from $v_{mathsf{s}}$ to $v_{mathsf{t}}$. At each round $t$, we select a path $mathbf{x}_t in X$ and receive bandit feedback on our loss $langle mathbf{x}_t, mathbf{y}_t
angle in [-1,1]$, where $mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $ ilde O(sqrt{|E|Tlog |X|})$ with high probability against any adaptive adversary, where $ ilde O(cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.