🤖 AI Summary
This work addresses the problem of identifying an $\varepsilon$-optimal action in depth-2 max-min trees under a fixed sampling budget, where the goal is to recommend a subtree whose minimal leaf value is within $\varepsilon$ of the optimal maximin value. The paper proposes the first $\varepsilon$-agnostic algorithm that requires no prior knowledge of $\varepsilon$ and achieves instance-dependent error guarantees for any valid $\varepsilon$. By integrating Monte Carlo tree search with halving-style and Successive Rejects strategies from multi-armed bandits, the method introduces a novel complexity measure, $H_2(\varepsilon)$, which captures both inter-subtree and intra-subtree hardness. Theoretical analysis shows that the probability of misidentification decays exponentially as $\exp(-\widetilde{\Theta}(T/H_2(\varepsilon)))$, providing the first provable fixed-budget performance guarantee for max-min action identification and revealing fundamental differences in difficulty structure compared to classical multi-armed bandit problems.
📝 Abstract
We study the fixed-budget max-min action identification problem in depth-2 max-min trees, an important special case of Monte Carlo Tree Search. A learner sequentially allocates $T$ samples to leaves and then recommends a subtree whose minimum leaf value is largest. Motivated by approximate planning, we focus on $\varepsilon$-good subtree identification, where any subtree whose min value is within $\varepsilon$ of the optimal maximin value is acceptable.
Our main contribution is an $\varepsilon$-agnostic algorithm: it does not require $\varepsilon$ as input, but achieves instance-dependent error bounds for every meaningful $\varepsilon$. We show that the misidentification probability decays as $\exp(-\widetildeΘ(T/H_2(\varepsilon)))$, where $H_2(\varepsilon)$ captures both cross-subtree and within-subtree gaps. When each subtree has a single leaf, the problem reduces to standard fixed-budget best-arm identification, and our analysis recovers, up to accelerating factors, known $\varepsilon$-good guarantees for halving-style methods while giving a new $\varepsilon$-good guarantee for Successive Rejects.
On the lower-bound side, we provide complementary positive and negative results showing that max-min identification has a different hardness structure from standard $K$-armed bandits. To our knowledge, this is the first provable fixed-budget algorithmic guarantee for max-min action identification.