🤖 AI Summary
This work investigates the minimization of satisficing regret in non-stationary multi-armed bandits, revealing that even with only two environment changes—i.e., piecewise-stationary settings with at least \(L \geq 2\) segments—the optimal satisficing regret must grow with the time horizon \(T\) and cannot remain bounded by a constant. To establish this result, we introduce a novel information-theoretic analysis framework based on Fano’s inequality, featuring a “post-interaction reference” construction that rigorously generalizes both classical and existing interactive Fano methods to accommodate non-stationary feedback structures. Our theoretical analysis shows that the optimal satisficing regret scales as \(\Theta(L \log T)\) for \(L \geq 2\), in stark contrast to the \(\Theta(1)\) bound achievable when \(L = 1\), and further identifies specific conditions under which constant regret can be recovered.
📝 Abstract
Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $Θ(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $Θ(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.