🤖 AI Summary
This work addresses the challenge of enhancing planning and reasoning reliability of large language models (LLMs) in complex strategic domains—specifically chess, Chess960, Connect Four, and Hex. We propose a dual-path search framework: “external search,” an LLM-guided, engine-free Monte Carlo Tree Search (MCTS), and “internal search,” a context-in-context linearized tree generation mechanism. To our knowledge, this is the first approach enabling LLMs to autonomously model state transitions and value functions entirely within the language space, significantly mitigating hallucination. It also introduces the first end-to-end trainable internal search and the first MCTS fully driven by LLMs without external game engines. Leveraging domain-specific pretraining and zero-/few-shot policy distillation, our model achieves grandmaster-level performance in chess, operates near human cognitive search budgets, and consistently outperforms general-purpose LLM baselines across all target games—matching or approaching the performance of specialized game engines.
📝 Abstract
Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that search-based planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: In external search, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and in internal search, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.