Extreme Value Monte Carlo Tree Search

📅 2024-05-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Classical planning with Monte Carlo Tree Search (MCTS) coupled with Gaussian-assumption Multi-Armed Bandits (MAB) suffers from two theoretical limitations: (i) the Gaussian reward model fails to reflect the non-negative, unbounded support of real-world rewards; and (ii) Full-Bellman backups lack formal justification. Method: This work introduces extreme-value statistics into MCTS for the first time, proposing a novel MAB framework tailored to non-negative bounded rewards. It designs two bandwidth-aware UCB variants—UCB1-Uniform and UCB1-Power—with provable regret bounds, and develops min/max Full-Bellman backpropagation grounded in extreme-value theory. Contribution/Results: Experiments on classical planning benchmarks demonstrate significant performance gains over Gaussian-MAB-based MCTS. The approach establishes a new online search paradigm for heuristic planning that is both theoretically rigorous—rooted in extreme-value statistics—and empirically robust, particularly in unbounded or poorly informed heuristic settings.

Technology Category

Application Category

📝 Abstract

Despite being successful in board games and reinforcement learning (RL), UCT, a Monte-Carlo Tree Search (MCTS) combined with UCB1 Multi-Armed Bandit (MAB), has had limited success in domain-independent planning until recently. Previous work showed that UCB1, designed for $[0,1]$-bounded rewards, is not appropriate for estimating the distance-to-go which are potentially unbounded in $mathbb{R}$, such as heuristic functions used in classical planning, then proposed combining MCTS with MABs designed for Gaussian reward distributions and successfully improved the performance. In this paper, we further sharpen our understanding of ideal bandits for planning tasks. Existing work has two issues: First, while Gaussian MABs no longer over-specify the distances as $hin [0,1]$, they under-specify them as $hin [-infty,infty]$ while they are non-negative and can be further bounded in some cases. Second, there is no theoretical justifications for Full-Bellman backup (Schulte&Keller, 2014) that backpropagates minimum/maximum of samples. We identified emph{extreme value} statistics as a theoretical framework that resolves both issues at once and propose two bandits, UCB1-Uniform/Power, and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.

Problem

Research questions and friction points this paper is trying to address.

Improves bandit algorithms for cost-to-go estimates in classical planning

Addresses underspecified support and theoretical gaps in existing methods

Proposes extreme value theory-based UCB1-Uniform with regret analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extreme Value Theory for cost-to-go estimation

UCB1-Uniform bandit algorithm with theoretical guarantees

Improved MCTS performance in classical planning domains

🔎 Similar Papers

ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze