π€ AI Summary
This paper addresses the low sample efficiency and poor decision quality of Monte Carlo Tree Search (MCTS) in complex environments. To this end, it introduces Doubly Robust (DR) off-policy estimation into the MCTS framework for the first time, yielding a hybrid evaluator that simultaneously guarantees unbiasedness and variance reduction. The proposed DR-MCTS achieves substantial cross-model-scale improvements in sample efficiency under partial observability. Experiments demonstrate: (i) an 88% win rate in Tic-Tac-Toeβ78 percentage points higher than standard MCTS; (ii) a 20.7% success rate on composite tasks in VirtualHome, more than doubling the baseline; and (iii) superior performance of small models over large models using standard MCTS. The core contribution is a theoretically grounded DR-MCTS architecture that significantly enhances both policy evaluation accuracy and data utilization efficiency.
π Abstract
We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS rollouts with DR estimation, offering theoretical guarantees of unbiasedness and variance reduction under specified conditions. Empirical evaluations in Tic-Tac-Toe and the partially observable VirtualHome environment demonstrate DR-MCTS's superior performance over standard MCTS. In Tic-Tac-Toe, DR-MCTS achieves an 88% win rate compared to a 10% win rate for standard MCTS. In compound VirtualHome tasks, DR-MCTS attains a 20.7% success rate versus 10.3% for standard MCTS. Our scaling analysis reveals that DR-MCTS exhibits better sample efficiency, notably outperforming standard MCTS with larger language models while using a smaller model. These results underscore DR-MCTS's potential for efficient decision-making in complex, real-world scenarios where sample efficiency is paramount.