🤖 AI Summary
Chain-of-thought (CoT) reasoning in large language models (LLMs) suffers from verbosity, inefficiency, and lack of systematic structure. Method: This paper proposes Tree-of-Thought Reinforcement Learning (ToTRL), the first framework that formalizes Tree-of-Thought (ToT) reasoning as a trainable reinforcement learning policy. Within puzzle-solving environments, ToTRL employs policy gradient methods to jointly optimize thought-tree generation and dynamic pruning. It incorporates regularized sparse rewards and task-driven explicit structural constraints on thought topology, enabling a paradigm shift from sequential CoT to parallel, prunable ToT. Contribution/Results: The resulting ToT-Qwen3-8B model achieves significant accuracy gains on complex reasoning benchmarks while reducing tokens per answer—demonstrating that tree-structured reasoning simultaneously improves inference quality and computational efficiency.
📝 Abstract
Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.