🤖 AI Summary
This paper studies best-arm identification (BAI) in multi-armed bandits under entropy-based Value-at-Risk (EVaR), addressing risk-averse decision-making in high-stakes domains such as finance. It introduces EVaR—previously unexplored in BAI—into the fixed-confidence BAI framework for bounded [0,1] reward distributions. We propose a δ-correct Track-and-Stop algorithm that employs nonparametric modeling and joint convex/nonconvex optimization to design risk-sensitive sampling policies. Leveraging large-deviations theory, we establish its asymptotic optimality: the sample complexity tightly matches the information-theoretic lower bound. Experiments demonstrate that the algorithm achieves the prescribed confidence level while substantially improving risk control over baseline methods. This work provides the first theoretically optimal BAI solution for risk-sensitive sequential decision-making, with formal guarantees on both statistical correctness and risk-aware performance.
📝 Abstract
We study the fixed-confidence best arm identification (BAI) problem within the multi-armed bandit (MAB) framework under the Entropic Value-at-Risk (EVaR) criterion. Our analysis considers a nonparametric setting, allowing for general reward distributions bounded in [0,1]. This formulation addresses the critical need for risk-averse decision-making in high-stakes environments, such as finance, moving beyond simple expected value optimization. We propose a $δ$-correct, Track-and-Stop based algorithm and derive a corresponding lower bound on the expected sample complexity, which we prove is asymptotically matched. The implementation of our algorithm and the characterization of the lower bound both require solving a complex convex optimization problem and a related, simpler non-convex one.