🤖 AI Summary
This paper studies the multi-objective ordinal multi-armed bandit problem under hierarchical lexicographic preferences, jointly optimizing for regret minimization and best-arm identification. We propose two elimination-based algorithms: the first performs layer-wise pruning according to preference priority, unifying both objectives within a single ordinal preference framework for the first time; the second leverages cross-objective reward signal sharing and dependency modeling to provably break the single-objective sample complexity lower bound, achieving superior theoretical performance. Both algorithms attain optimal single-objective rates in cumulative regret and sample complexity, and empirically outperform existing baselines by significant margins. Our core contributions are: (i) establishing the first theoretical framework for joint optimization of regret and identification under lexicographic preferences; and (ii) proving that strategic cross-objective information reuse yields substantial statistical efficiency gains.
📝 Abstract
In multi-objective decision-making with hierarchical preferences, lexicographic bandits provide a natural framework for optimizing multiple objectives in a prioritized order. In this setting, a learner repeatedly selects arms and observes reward vectors, aiming to maximize the reward for the highest-priority objective, then the next, and so on. While previous studies have primarily focused on regret minimization, this work bridges the gap between extit{regret minimization} and extit{best arm identification} under lexicographic preferences. We propose two elimination-based algorithms to address this joint objective. The first algorithm eliminates suboptimal arms sequentially, layer by layer, in accordance with the objective priorities, and achieves sample complexity and regret bounds comparable to those of the best single-objective algorithms. The second algorithm simultaneously leverages reward information from all objectives in each round, effectively exploiting cross-objective dependencies. Remarkably, it outperforms the known lower bound for the single-objective bandit problem, highlighting the benefit of cross-objective information sharing in the multi-objective setting. Empirical results further validate their superior performance over baselines.