🤖 AI Summary
This study investigates the intrinsic trade-off between regret minimization and statistical inference accuracy in combinatorial multi-armed bandits. Addressing both full-feedback and semi-feedback structures, the work proposes an adaptive combinatorial experimental design grounded in a Pareto optimality framework, formally characterizing—for the first time—the Pareto frontier between cumulative regret and statistical power, and establishing necessary and sufficient conditions for Pareto-efficient learning. Leveraging information-theoretic tools and upper confidence bound (UCB) techniques, two algorithms—MixCombKL (based on KL divergence) and MixCombUCB (based on UCB)—are developed to simultaneously achieve low cumulative regret and high parameter estimation accuracy within a finite time horizon. Theoretical analysis confirms their Pareto optimality and demonstrates that richer feedback structures substantially tighten performance bounds.
📝 Abstract
In this paper, we provide the first investigation into adaptive combinatorial experimental design, focusing on the trade-off between regret minimization and statistical power in combinatorial multi-armed bandits (CMAB). While minimizing regret requires repeated exploitation of high-reward arms, accurate inference on reward gaps requires sufficient exploration of suboptimal actions. We formalize this trade-off through the concept of Pareto optimality and establish equivalent conditions for Pareto-efficient learning in CMAB. We consider two relevant cases under different information structures, i.e., full-bandit feedback and semi-bandit feedback, and propose two algorithms MixCombKL and MixCombUCB respectively for these two cases. We provide theoretical guarantees showing that both algorithms are Pareto optimal, achieving finite-time guarantees on both regret and estimation error of arm gaps. Our results further reveal that richer feedback significantly tightens the attainable Pareto frontier, with the primary gains arising from improved estimation accuracy under our proposed methods. Taken together, these findings establish a principled framework for adaptive combinatorial experimentation in multi-objective decision-making.