Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the multi-objective ordinal multi-armed bandit problem under hierarchical lexicographic preferences, jointly optimizing for regret minimization and best-arm identification. We propose two elimination-based algorithms: the first performs layer-wise pruning according to preference priority, unifying both objectives within a single ordinal preference framework for the first time; the second leverages cross-objective reward signal sharing and dependency modeling to provably break the single-objective sample complexity lower bound, achieving superior theoretical performance. Both algorithms attain optimal single-objective rates in cumulative regret and sample complexity, and empirically outperform existing baselines by significant margins. Our core contributions are: (i) establishing the first theoretical framework for joint optimization of regret and identification under lexicographic preferences; and (ii) proving that strategic cross-objective information reuse yields substantial statistical efficiency gains.

Technology Category

Application Category

📝 Abstract
In multi-objective decision-making with hierarchical preferences, lexicographic bandits provide a natural framework for optimizing multiple objectives in a prioritized order. In this setting, a learner repeatedly selects arms and observes reward vectors, aiming to maximize the reward for the highest-priority objective, then the next, and so on. While previous studies have primarily focused on regret minimization, this work bridges the gap between extit{regret minimization} and extit{best arm identification} under lexicographic preferences. We propose two elimination-based algorithms to address this joint objective. The first algorithm eliminates suboptimal arms sequentially, layer by layer, in accordance with the objective priorities, and achieves sample complexity and regret bounds comparable to those of the best single-objective algorithms. The second algorithm simultaneously leverages reward information from all objectives in each round, effectively exploiting cross-objective dependencies. Remarkably, it outperforms the known lower bound for the single-objective bandit problem, highlighting the benefit of cross-objective information sharing in the multi-objective setting. Empirical results further validate their superior performance over baselines.
Problem

Research questions and friction points this paper is trying to address.

Bridging regret minimization and best arm identification in lexicographic bandits
Proposing elimination algorithms for hierarchical multi-objective optimization
Exploiting cross-objective dependencies to surpass single-objective performance limits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential elimination based on objective priorities
Simultaneous reward utilization across all objectives
Outperforming single-objective lower bounds through sharing
🔎 Similar Papers
No similar papers found.