ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chain-of-thought (CoT) reasoning in large language models (LLMs) suffers from verbosity, inefficiency, and lack of systematic structure. Method: This paper proposes Tree-of-Thought Reinforcement Learning (ToTRL), the first framework that formalizes Tree-of-Thought (ToT) reasoning as a trainable reinforcement learning policy. Within puzzle-solving environments, ToTRL employs policy gradient methods to jointly optimize thought-tree generation and dynamic pruning. It incorporates regularized sparse rewards and task-driven explicit structural constraints on thought topology, enabling a paradigm shift from sequential CoT to parallel, prunable ToT. Contribution/Results: The resulting ToT-Qwen3-8B model achieves significant accuracy gains on complex reasoning benchmarks while reducing tokens per answer—demonstrating that tree-structured reasoning simultaneously improves inference quality and computational efficiency.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM reasoning by replacing chain-of-thought with tree-of-thoughts
Reduce verbose outputs and improve efficiency in LLM reasoning
Train LLMs using puzzle games to develop parallel reasoning strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses tree-of-thoughts RL for parallel reasoning
Integrates puzzle games for ToT training
Combines rule-based rewards with on-policy RL
🔎 Similar Papers
No similar papers found.