COMPASS-Hedge: Learning Safely Without Knowing the World

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving worst-case robustness in adversarial environments, instance optimality in stochastic settings, and safety relative to a baseline policy in full-information online learning. The paper proposes a fully parameter-free algorithm that, without prior knowledge of the environment type or suboptimality gaps, integrates adaptive pseudo-regret scaling, phased aggressive exploration, and a baseline-aware mixing mechanism within a unified framework. This approach delivers threefold guarantees: minimax-optimal regret in adversarial regimes, instance-optimal regret in stochastic regimes, and Õ(1) safety regret—relative to any fixed baseline—with only a logarithmic factor degradation. To the best of our knowledge, this is the first algorithm in the full-information setting to achieve such “triple optimality” without requiring any environmental assumptions or tuning parameters.

Technology Category

Application Category

📝 Abstract

Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

Problem

Research questions and friction points this paper is trying to address.

online learning

regret minimization

baseline safety

adversarial environment

stochastic environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-free online learning

best-of-three-world

baseline safety