🤖 AI Summary
Large language models (LLMs) often suffer from insufficient reasoning depth and inability to accumulate trial-and-error experience when tackling complex problems such as mathematical reasoning. To address this, we propose Bootstrapped Tree-of-Thoughts (BoT), a framework enabling LLMs to autonomously construct and iteratively explore multiple tree-structured reasoning paths—without human-provided examples. BoT integrates self-evaluation scoring with dynamic prompt refinement guided by the model’s own error analysis, thereby closing the loop for trial-and-error experience accumulation and enabling continuous prompt optimization. Its core innovation lies in the first integration of error-driven prompt self-correction with tree-based integrative reasoning, effecting a paradigm shift from single-step inference to experience-augmented reasoning. Experiments on GPT-4 and Llama2 demonstrate that BoT significantly improves accuracy on complex mathematical problems, outperforming baseline methods including Tree of Thoughts.
📝 Abstract
The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.