A New Benchmark for Online Learning with Budget-Balancing Constraints

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies adversarial online multi-armed bandits under budget-balance constraints, addressing the fundamental impossibility of no-regret learning against fixed-distribution benchmarks. We introduce a dynamic expenditure-pattern benchmark based on Earth Mover’s Distance (EMD), the first to employ EMD as a metric for characterizing feasible expenditure trajectories and to formally model practical pacing strategies—such as windowed pacing. Theoretically, we establish that this benchmark is both necessary and optimal: EMD = o(T²) is the *if-and-only-if* condition for achieving sublinear regret; moreover, we derive a tight regret bound of $ ilde{O}(T/sqrt{w} + sqrt{wT})$, matching the information-theoretic lower bound. Our work establishes a novel benchmarking paradigm and analytical framework for budget-constrained online learning.

Technology Category

Application Category

📝 Abstract
The adversarial Bandit with Knapsack problem is a multi-armed bandits problem with budget constraints and adversarial rewards and costs. In each round, a learner selects an action to take and observes the reward and cost of the selected action. The goal is to maximize the sum of rewards while satisfying the budget constraint. The classical benchmark to compare against is the best fixed distribution over actions that satisfies the budget constraint in expectation. Unlike its stochastic counterpart, where rewards and costs are drawn from some fixed distribution (Badanidiyuru et al., 2018), the adversarial BwK problem does not admit a no-regret algorithm for every problem instance due to the"spend-or-save"dilemma (Immorlica et al., 2022). A key problem left open by existing works is whether there exists a weaker but still meaningful benchmark to compare against such that no-regret learning is still possible. In this work, we present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure. The benchmark is based on the Earth Mover's Distance (EMD), and we show that sublinear regret is attainable against any strategy whose spending pattern is within EMD $o(T^2)$ of any sub-pacing spending pattern. As a special case, we obtain results against the"pacing over windows"benchmark, where we partition time into disjoint windows of size $w$ and allow the benchmark strategies to choose a different distribution over actions for each window while satisfying a pacing budget constraint. Against this benchmark, our algorithm obtains a regret bound of $ ilde{O}(T/sqrt{w}+sqrt{wT})$. We also show a matching lower bound, proving the optimality of our algorithm in this important special case. In addition, we provide further evidence of the necessity of the EMD condition for obtaining a sublinear regret.
Problem

Research questions and friction points this paper is trying to address.

Maximize rewards under adversarial budget constraints.
Develop a no-regret benchmark using Earth Mover's Distance.
Achieve sublinear regret against pacing-over-windows strategies.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Earth Mover's Distance benchmark
Achieves sublinear regret with EMD condition
Optimal algorithm for pacing over windows
🔎 Similar Papers
No similar papers found.
Mark Braverman
Mark Braverman
Princeton University
J
Jingyi Liu
Department of Computer Science, Princeton University
Jieming Mao
Jieming Mao
Google Research NYC
Theoretical Computer Science
Jon Schneider
Jon Schneider
Google Research
Machine LearningGame TheoryTheoretical Computer Science
E
Eric Xue
Department of Computer Science, Princeton University