🤖 AI Summary
This paper studies the sparse adversarial stochastic shortest path (SSP) problem under full-information feedback. To address the limitations of existing negative-entropy regularization—namely, its suboptimal regret bound of √log(SA) when transitions are known and its inability to exploit cost sparsity (only M ≪ SA state-action pairs incur nonzero costs)—we propose an online mirror descent algorithm with ℓᵣ-norm regularization for r ∈ (1,2). Our method is the first to adaptively leverage cost sparsity: under known transitions, it achieves the optimal regret bound of √log M, which we prove tight via a matching lower bound; under unknown transitions, we rigorously characterize the fundamental limit of sparsity gains—showing that any algorithm’s minimax regret must scale polynomially with SA. These results establish that the intrinsic complexity of SSP is governed by the effective sparse dimension M, not the full state-action space size SA.
📝 Abstract
We study the adversarial Stochastic Shortest Path (SSP) problem with sparse costs under full-information feedback. In the known transition setting, existing bounds based on Online Mirror Descent (OMD) with negative-entropy regularization scale with $sqrt{log S A}$, where $SA$ is the size of the state-action space. While we show that this is optimal in the worst-case, this bound fails to capture the benefits of sparsity when only a small number $M ll SA$ of state-action pairs incur cost. In fact, we also show that the negative-entropy is inherently non-adaptive to sparsity: it provably incurs regret scaling with $sqrt{log S}$ on sparse problems. Instead, we propose a family of $ell_r$-norm regularizers ($r in (1,2)$) that adapts to the sparsity and achieves regret scaling with $sqrt{log M}$ instead of $sqrt{log SA}$. We show this is optimal via a matching lower bound, highlighting that $M$ captures the effective dimension of the problem instead of $SA$. Finally, in the unknown transition setting the benefits of sparsity are limited: we prove that even on sparse problems, the minimax regret for any learner scales polynomially with $SA$.